10,000 Matching Annotations
  1. Mar 2026
    1. Reviewer #1 (Public review):

      Summary:

      The authors provide a simple yet elegant approach to identifying therapeutic targets that synergize to prevent therapeutic resistance using cell lines, data-independent acquisition proteomics, and bioinformatic analysis. The authors identify several combinations of pharmaceuticals that were able to overcome or prevent therapeutic resistance in culture models of ovarian cancer, a disease with an unmet diagnostic and therapeutic need.

      Strengths:

      The manuscript utilizes state-of-the-art proteomic analysis, entailing data-independent acquisition methods, an approach that maximizes the robustness of identified proteins across cell lines. The authors focus their analysis on several drugs under development for the treatment of ovarian cancer and utilize straightforward thresholds for identifying proteomic adaptations across several drugs on the OVSAHO cell line. The authors utilized three independent and complementary approaches to predicting drug synergy (NetBox, GSEA, and Manual Curation). The drug combination with the most robust synergy across multiple cell lines was the inhibition of MEK and CDK4/6 using PD-0325901+Palbociclib, respectively. Additional combinations, including PARPi (rucaparib) and the fatty acid synthase inhibitor (TVB-2640). Collectively, this study provides important insight and exemplifies a solid approach to identifying drug syngery without large drug library screens.

      Weaknesses:

      The manuscript supports their findings by describing the biological function(s) of targets using referenced literature. While this is valuable, the number of downstream targets for each initial target is extensive, thus, the current work does not attempt to elucidate the mechanism of their drug synergy. Responses to drugs are quantified 72 hours after treatment and exclusively focused on cell viability and protein expression levels. The discovery phase of experimentation was solely performed on OVSAHO cell line. An additional cell line(s) would increase the impact of how the authors went about identifying synergistic targets using bioinformatics. Ovarian cancer is elusive to treatment as primary cancer will form spheroids within ascites/peritoneal fluids in a state of pseudo-senescence to overcome environmental stress. The current manuscript is executed in 2D culture, which has been demonstrated to deviate from 3D, PDX, and primary tumours in terms of therapeutic resistance (DOI: 10.3390/cancers13164208). Collectively, the manuscript is insufficient in providing additional mechanistic insight beyond the literature, and its interpretation of data is limited to 2D culture until further validated.

      Comments on revisions:

      The reviewer has no further recommendations for the authors.

    2. Reviewer #2 (Public review):

      Summary:

      Franz and colleagues combined proteomics analysis of OVSAHO cell lines treated with 6 individual drugs. The quantitative proteomics data was then used for computational analysis to identify candidates/modules that could be used to predict combination treatments for specific drugs.

      Strengths:

      The authors present solid proteomics data and computational analysis to effectively repeat at the proteomics level analysis that have previously been done predominately with transcriptional profiling. Since most drugs either target proteins and/or proteins are the functional units of cells, this makes intuitively sense.

      Weaknesses:

      Considering the available resources of the involved teams, preforming the initial analysis in a single HGSC cells is certainly a weakness/limitation. During the revision additional cell lines were used for verification.

      The data also shows how challenging it is to correctly predict drug combinations. In Table 2 (if I read it correctly) the majority of the drug combinations predicted for the initial cell line OVSAHO did not result in the predicted effect. It also shows how variable response was in the different HGSC cell lines used for combination treatment. The success rate will most likely continue to drop as more sophisticated models are being used (i.e., PDX). Human patients are even more challenging.

      It would most likely be useful to more directly mention/discuss these caveats in the manuscript. This was added to the discussion during the revision. Overall the authors have responded to previous suggestions.

    3. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      The authors provide a simple yet elegant approach to identifying therapeutic targets that synergize to prevent therapeutic resistance using cell lines, data-independent acquisition proteomics, and bioinformatic analysis. The authors identify several combinations of pharmaceuticals that were able to overcome or prevent therapeutic resistance in culture models of ovarian cancer, a disease with an unmet diagnostic and therapeutic need.

      Strengths:

      The manuscript utilizes state-of-the-art proteomic analysis, entailing data-independent acquisition methods, an approach that maximizes the robustness of identified proteins across cell lines. The authors focus their analysis on several drugs under development for the treatment of ovarian cancer and utilize straightforward thresholds for identifying proteomic adaptations across several drugs on the OVSAHO cell line. The authors utilized three independent and complementary approaches to predicting drug synergy (NetBox, GSEA, and Manual Curation). The drug combination with the most robust synergy across multiple cell lines was the inhibition of MEK and CDK4/6 using PD-0325901+Palbociclib, respectively. Additional combinations, including PARPi (rucaparib) and the fatty acid synthase inhibitor (TVB-2640). Collectively, this study provides important insight and exemplifies a solid approach to identifying drug synergy without large drug library screens.

      Weaknesses:

      The manuscript supports their findings by describing the biological function(s) of targets using referenced literature. While this is valuable, the number of downstream targets for each initial target is extensive, thus, the current work does not attempt to elucidate the mechanism of their drug synergy. Responses to drugs are quantified 72 hours after treatment and exclusively focused on cell viability and protein expression levels. The discovery phase of experimentation was solely performed on the OVSAHO cell line. An additional cell line(s) would increase the impact of how the authors went about identifying synergistic targets using bioinformatics. Ovarian cancer is elusive to treatment as primary cancer will form spheroids within ascites/peritoneal fluids in a state of pseudo-senescence to overcome environmental stress. The current manuscript is executed in 2D culture, which has been demonstrated to deviate from 3D, PDX, and primary tumours in terms of therapeutic resistance (DOI: 10.3390/cancers13164208). Collectively, the manuscript is insufficient in providing additional mechanistic insight beyond the literature, and its interpretation of data is limited to 2D culture until further validated.

      We appreciate your positive remarks on the use of NetBox, GSEA, and human curation for predicting anti-resistance effects of second drugs. Regarding the weaknesses you identified:

      Mechanistic Insight: We agree that our current work interprets findings using prior published knowledge and does not attempt to infer detailed mechanisms of drug resistance of the nominated drug combinations. Our primary goal with this study was to establish a robust, unbiased proteomic and computational pipeline for proposing anti-resistance drug combinations, rather than to fully characterize the downstream molecular effects for each combination or to prove causation. To get closer to mechanistic insight, meaning detailed hypotheses of causative interactions, one would need to investigate anti-resistance effects in other pre-clinical materials as a crucial next step for the most promising combinations identified. This was out of scope for us. We assume the proposed combinations are useful for focussed follow-up in the community.

      Discovery Phase on a Single Cell Line: Our discovery phase was focused solely on the OVSAHO cell line due to its resemblance to surgical ovarian cancer samples. Including additional cell lines in the initial proteomic-response discovery phase plausibly would have enhanced the generalizability. But this was not done due to resource constraints. However, we did perform more extensive validation of the effect of drug combinations on proliferation in several cell lines to explore broader applicability.

      2D Culture Limitations: We are fully aware of the limitations of 2D cell culture models, especially in the context of ovarian cancer, where in clinical reality interactions with the microenvironment and other effects can have significant roles in therapeutic resistance. Adn we recognize that in lab experiments 2D culture does not fully recapitulate the complexities of 3D tumors, PDX models, or primary patient tumors. We have added citations to the relevant literature (including the reference you provided), and have emphasized in the Discussion that our findings serve as a strong foundation for future experimental tests (validation) in more physiologically relevant experimental model systems.

      Reviewer #2 (Public review):

      Summary:

      Franz and colleagues combined proteomics analysis of OVSAHO cell lines treated with 6 individual drugs. The quantitative proteomics data were then used for computational analysis to identify candidates/modules that could be used to predict combination treatments for specific drugs.

      Strengths:

      The authors present solid proteomics data and computational analysis to effectively repeat at the proteomics level analysis that have previously been done predominantly with transcriptional profiling. Since most drugs either target proteins and/or proteins are the functional units of cells, this makes intuitive sense.

      Weaknesses:

      Considering the available resources of the involved teams, performing the initial analysis in a single HGSC cell is certainly a weakness/limitation.

      The data also shows how challenging it is to correctly predict drug combinations. In Table 2 (if I read it correctly), the majority of the drug combinations predicted for the initial cell line OVSAHO did not result in the predicted effect. It also shows how variable the response was in the different HGSC cell lines used for the combination treatment. The success rate will most likely continue to drop as more sophisticated models are being used (i.e., PDX). Human patients are even more challenging.

      It would most likely be useful to more directly mention/discuss these caveats in the manuscript.

      Thank you for your summary and positive comments. Regarding the weaknesses you identified:

      Initial Analysis in a Single Cell Line: We concur with your assessment that performing the initial analysis in a single HGSC cell line (OVSAHO) is a limitation. As mentioned in our response to Reviewer #1, resource limitations caused this decision, and we acknowledge that a broader initial screen would have strengthened generalizability. We added this limitation in the discussion section, emphasizing use of diverse cell lines in the initial protein response profiling as an area for future work.

      Challenges in Predicting Drug Combinations and Variability: We thank the observation regarding the challenges in predicting the effect of drug combinations and the variability of antiproliferative effects observed in different HGSC cell lines (Table 2). As with any predictive method, our computational-experimental pipeline is not guaranteed to identify with absolute certainty additive or synergistic interactions, but generates data-informed hypotheses to be considered in the presence of other available observations. We now emphasize in the Discussion that while our computational pipeline provides plausible anti-resistance candidates, the precise results (extent of additivity or synergy) differ in different cell lines. This underscores that experimental validation across diverse physiological models, such as PDXs or organoids (not just additional cell lines) is an essential criterion of validity of the generated hypotheses. And we underscore the (obvious) challenge of the ultimate translation of pre-clinical experiments to therapeutic effects in humans.

      In revision, we have clarified in detail the expectation of predicted synergy implied by the reviewer’s comment, “the majority of the drug combinations predicted for the initial cell line OVSAHO did not result in the predicted effect”. This reflects a misunderstanding of our goals. The predictions are for drug effects that are anti-resistant, such that the proteomic response to one drug is counteracted by the second drug. The predicted effect is not synergy. Indeed, useful anti-resistance effect does not require synergy - additivity is sufficient: if cells are resistant to the original drug, the second drug plausibly still has antiproliferative effect, as it targets the cellular processes that are increased in activity (upregulated) in response to the first drug. So we deleted the red synergy color in Table 2 to avoid the potential conclusion from our results that without synergy, there is no benefit to a drug combination. In fact, additive drug combination effects are in themselves beneficial. For clarity on this point, added coloring in Table 2 to highlight the small number of combinations that did not work well in that the combination was clearly antagonistic, using a combination index CI >= 2.0 cutoff; we clarify this point in the Discussion.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      (1) Figure 2b. This figure would be more impactful if presented as an upset plot with the same Venn diagram embedded. I am not sure Figure 2C accurately supports the statement : "Frequently affected proteins generally had expression level changes in the same direction across all drug perturbations (Figure 2c), indicating a potential general stress response. ". It would be beneficial if the authors could present the data in a way that shows the number of genes with similar directional groupings. Likewise, the color scheme for this figure is hard to interpret as grey is the most negative value and values are preselected for absolute fold-change. Please consider colors with a stronger contrast.

      Authors should consider uploading MS files to the PRIDE or MASSIVE repository.

      We have addressed these very useful suggestions. We have edited Figure 2b to include the requested upset plot. It serves to illustrate the intersection of proteins responding to different perturbation conditions; due to figure space constraints, we limit the figure to entries with counts of at least 15. We have added the number of proteins with consistent directional changes in the figure 2c caption and the text.

      For Figure 2c, we have edited the color bar legend to better reflect the colors that appear in the heatmap.

      We have added our mass-spectrometry drug-response dataset to the ProteomeXchange Consortium via PRIDE with accession number PXD066316.

    1. eLife Assessment

      This valuable computational study presents a conceptually simple and biologically plausible reinforcement-learning framework for motor learning based on policy-gradient methods. The evidence supporting the conclusions is convincing, including rigorous mathematical derivations of learning rules for the mean and variance of motor commands and simulation results for three sets of experimental data, based on three different motor learning tasks from the literature. However, there is a lack of a clear description of the specific conditions under which this framework yields unique mechanistic insights or predictive values, hence falling short of qualifying as a "general theory of motor learning". The work will be of interest to researchers in computational motor learning and motor neuroscience.

    2. Reviewer #1 (Public review):

      Summary:

      This study proposes a simple and universal reinforcement-learning framework for understanding learning in complex motor tasks. Central to the framework is a policy-gradient algorithm, in which motor commands are updated not via the gradient of the reward with respect to policy parameters, but via the gradient of the policy itself, scaled by reward information. The authors demonstrate that this scheme can reproduce learning dynamics that have been reported in previous empirical studies.

      Strengths:

      The key contribution of this study lies in its application of a policy-gradient algorithm to describe motor learning processes. This idea is biologically plausible, as computing the gradient of the policy with respect to its parameters is likely to be substantially easier for the nervous system than computing the gradient of the reward with respect to policy parameters. The authors present three representative examples showing that this scheme can capture several aspects of motor learning dynamics. Notably, providing such a unified description across different tasks has been difficult for conventionally proposed learning frameworks, such as supervised learning.

      Weaknesses:

      While this scheme is valuable in that it captures certain aspects of learning dynamics, I find that its overall significance is limited for the following reasons.

      (1) The empirical results examined in this study primarily demonstrate that motor learning drives performance toward the spatial task goal while reducing variability. Given that the policies are expressed using Gaussian distributions and that their parameters (i.e., the mean and covariance matrix) are updated during learning, it is not surprising that the proposed scheme can reproduce these results by fitting the parameters to the data.

      (2) The proposed framework assumes that the motor learning system relies on the gradient of the policy with respect to its parameters. However, I am not convinced that this assumption is always appropriate, because in all three empirical studies examined here, explicit spatial error information is available. In such cases, the motor learning system could, in principle, compute the gradient of the error with respect to the policy parameters directly, without relying on a policy-gradient mechanism.

      (3) Most importantly, it remains unclear how the proposed scheme advances our understanding of the underlying learning mechanisms beyond providing a descriptive account of the learning process. While the framework offers a compact mathematical description of learning dynamics, it is uncertain how it can yield novel mechanistic insights or testable predictions that distinguish it from existing learning models.

    3. Reviewer #2 (Public review):

      Summary:

      In this study, Haith applies, and to some extent extends, the theoretical framework of policy gradient (PG) and the derived REINFORCE learning rules to human motor learning. This approach is coherent because human motor skill learning is characterized by improvements in both accuracy and precision (the inverse of variance), and REINFORCE provides update rules for both the mean and the variance of the motor commands.

      Weaknesses:

      The mean update (equation 4) is given in task space (i.e., angle and velocity for the skittle task), but the covariance update (equation 5) is given in eigenvector space. This formulation appears to have been provided for computational convenience, as it ensures that the variances are always positive by exponentiating the eigenvalues. However, this eigenspace formulation is somewhat artificial and complex (notably the update rule for the orientation of the covariance matrix) and seems far from biological reality. A simpler alternative, suggested by the author, is to provide the full covariance matrix, including crossed terms, and derive equations to update the diagonal variance terms and the cross-terms (perhaps after a transformation to keep all elements positive if needed). This would provide a simpler and more biologically plausible update to the covariance matrix terms, in the spirit of the original REINFORCE algorithm. The author suggests that he has derived the update rule for the cross terms, so this should be relatively easy to write and update, especially for the skittle learning rules. If the author wishes to keep their rules in simulations, then the two mathematical rules could be presented in the methods or a supplementary material section.

      The discussion about binary rewards and the increase in variance in previous experiments is potentially interesting. However, I do not understand why variance cannot increase with the policy-gradient RL update? Surely, equation 5 can lead to both an increase and a decrease in variance depending on the reward prediction error and the noise (for example, suppose the noise at trial i is small and leads to a smaller reward than the baseline; variance would increase). It would be interesting to see detailed simulation results for the skittle task showing changes in both mean and variance across a few consecutive trials, with both increases and decreases in reward prediction errors. These results could then be compared in simulations with those of a task with discrete binary rewards.

      Generalization is a major feature of human learning, but it is not discussed or studied here. In fact, in the de novo task simulations, there can be no generalization because the values are modeled as running averages for each target rather than derived from a critic network. Can the author discuss this point and, ideally, show generalization results in simulations, say in the skittle task?

      The application of the model to reproduce the Shmuelof et al. data is, at the same time, justified (because one of their main results is an improvement in precision, which Policy Gradient directly addresses) and somewhat "forced," as the author approximates curved movements with a series of straight-line movements. The author therefore needs to specify multiple via points with PG updating and a reward function that also enforces smoothness. The justification for the Guigon 2023 model seems somewhat artificial because it mainly applies to slow movements. Can the author comment and discuss alternatives that do not require via points, drawing from the robotics literature if needed (Schaal's Dynamic Movement Primitives come to mind, for example).

      Policy Gradient requires both a "noisy" and a clean "pass", making it non-biological in its simplest form. Legenstein et al. (2010) and Miconi (2017) provided biologically plausible forms for the mean update. Since Policy Gradient is proposed as a model of human motor learning, can the author discuss the biological plausibility of the proposed learning rules and possible biologically plausible extensions?

    1. eLife Assessment

      This study addresses an important gap in drug discovery by delivering a rigorous, large-scale evaluation of widely used co-folding methods for predicting ligand-bound protein complexes and virtual screening. A key strength is the comprehensive benchmarking framework, which leverages structures and chemical compounds that were absent from the AI models training set, thereby providing particularly compelling and unbiased evidence of co-folding performance. The findings clearly delineate the complementary roles of deep learning-based co-folding and physics-based docking, offering practical guidance for their rational integration into drug discovery workflows. Although the conclusions are convincing, improvements in the test cases, presentation, and usability can further strengthen the overall impact.

    2. Reviewer #1 (Public review):

      The authors conducted a comprehensive benchmarking and evaluation of co-folding platforms, including AlphaFold3, Boltz-2, Chai-1, and the docking algorithm Dock3.7, which employs a physics-based scoring function that incorporates van der Waals interactions, electrostatics, and ligand desolvation energies. The system of interest was the SARS-CoV-2 NSP3 macrodomain (Mac1), an increasingly popular antiviral target, and the ligand sets comprised 557 unseen ligand poses (keeping the training for these co-folding platforms in mind). Additionally, the authors investigated whether the co-folding models could distinguish true ligands from non-binding small molecules. The study is thorough, with extensive statistical support and consensus across multiple metrics (chemoinformatics for quantifying ligand similarity and efficacy). The questions that the authors aim to address are whether the co-folding models struggle with memorization, whether they can distinguish between a true and a false binder, whether they replicate experimental binding affinities and efficacy, and how they compare to the physics-based docking algorithm (Dock3.7).

      Strengths:

      Overall, this is a scientifically solid paper. The work is highly detailed and well executed, featuring thorough data analysis and statistical assessment.

      Weaknesses:

      My main concern is that the study's aim is a bit unclear. Modern benchmarking studies comparing physics-based docking with deep learning-based co-folding approaches (e.g., AF3, Boltz-2, Chai-1, and others) are increasingly expected to go beyond aggregate performance metrics. In addition to rigorous dataset construction, transparent methodology, and appropriate statistical evaluation, high-impact benchmarks typically provide actionable guidance on when each method class is most appropriate, reflecting their distinct inductive biases and practical constraints. Failure-mode analyses that link performance differences to protein flexibility, ligand chemistry, or binding-site characteristics are particularly valuable, as they move comparisons beyond "scoreboard" assessments toward mechanistic understanding. While full biological validation is not expected, qualitative interpretation grounded in physical and biological principles strengthens conclusions. Providing reproducible workflows or reference pipelines is not mandatory, but it is increasingly viewed as a best practice because it facilitates adoption and helps contextualize results for practitioners.

    3. Reviewer #2 (Public review):

      Summary:

      The manuscript by Kim et al. evaluates the performance of three modern AI-based methods in predicting complex structures and binding affinities between proteins and chemical compounds. An honest 'prospective' evaluation is achieved by studying benchmark structures and chemical compounds that did not exist in the PDB at the time the AI structure prediction models (AlphaFold3, Chai-1, Boltz-2) were trained.

      Strengths:

      (1) The study addresses an important question in modern computational biology and drug discovery, and establishes the strengths and limitations of the three tools in solving various computational chemistry tasks, including compound pose prediction, active-inactive discrimination, and potency ranking.

      (2) The conclusions are based on examination of four separate targets and respective compound datasets, where for one of the targets, the authors also obtained numerous X-ray structures to serve as experimental answers for the binding pose prediction task.

      (3) The study reports relationships between structure prediction confidence, predicted energies (DOCK3.7), and affinity predictions (Boltz-2) with the geometric accuracy of compound pose prediction as well as the experimentally measured potency.

      (4) One of the key findings is the limited ability of co-folding methods to predict conformational rearrangements, which does not correlate with their ability to predict binding poses of the compounds inducing these rearrangements.

      (5) The findings could serve as useful guidelines for computational chemists in selecting appropriate software and scoring schemes for each task.

      Weaknesses:

      While I consider this a solid study, several aspects would need to be addressed to make it really strong:

      (1) DOCK3.7 docking and scoring experiments were performed using one experimental structure of Mac1, selected from dozens of structures based on a criterion that is not sufficiently well justified. For sigma2 receptor, dopamine D4 receptor, and AmpC β-lactamase, it is not clear which structures or models were selected for docking at all. It is well known that geometry predictions, scoring, and active-inactive ROC AUCs are all strongly influenced by the selected structure. It would be important to attempt Mac1 docking using all available experimental Mac1 structures, or at least against representative structures in various conformations; it would also be quite insightful to compare results to docking of the same compound sets to AF3, Boltz-2 and Chai-1 predicted structures of Mac1. Same goes for the docking studies of sigma2, D4, and AmpC β-lactamase.

      (2) For binding affinity predictions, as a control, authors should consider compound co-folding with an unrelated protein, or even with a pseudo-peptide that consists of a few random single amino acids - this would provide an honest baseline for such predictions.

      (3) ROC curves Figure 3 and elsewhere should be shown, and AUCs quantified/reported on a log or square-root scaled x-axis, to emphasize early enrichment, which is the area of practical significance for these predictions. For example, Figure 3A currently suggests that the pose prediction performance of AF3 exceeds that of Boltz-2 whereas the early enrichment is clearly better for Boltz-2.

      (4) 'Trained set' in figures and text should probably be 'training set'? Or otherwise explain this new term the first time it is introduced.

      (5) Figure 1 illustrates a projection onto the first two principal components of a space that apparently had only one (scalar) metric for each compound pair (% maximum common substructure or Tanimoto coefficient); the authors need to better explain the principle behind this analysis and visualization.

    4. Reviewer #3 (Public review):

      Summary:

      This study's core conclusions are well-supported by data. It is shown that co-folding outperforms docking in known ligand pose/affinity prediction (validated by RMSD and IC₅₀ correlation), struggles with false-positive discrimination in virtual screens (lower AUC values), and is complementary to docking (non-correlated errors, distinct strengths in drug discovery stages).

      Strengths:

      (1) Unprecedented prospective design with 557 novel Mac1-ligand complexes ensures rigorous, independent evaluation of co-folding methods.

      (2) Comprehensive comparison of 3 co-folding tools (AlphaFold3, Chai-1, Boltz-2) with DOCK3.7 across diverse targets and metrics enables nuanced performance assessment.

      (3) The study clearly demonstrates complementary roles of co-folding (superior pose/affinity prediction for known ligands) and docking (better hit prioritization), and addresses deep learning memorization concerns via ligand similarity analysis.

      Weaknesses:

      (1) Limited generalization to diverse protein families (e.g., no ion channels/transporters).

      (2) Ambiguity in the mechanism underlying co-folding's failure to predict rare conformational changes.

      (3) Virtual screen comparison is unbalanced (docking-prioritized hit lists bias results).

    1. eLife Assessment

      In this study, the authors describe the degradation of HDACs in late HSV-1 infection and attempt to link this phenomenon to HDAC export to the cytoplasm and to DNA damage response. However, the evidence is incomplete, as many of the experiments are lacking in rigor. As a result, mechanistic links to the proposed model are weak.

    2. Reviewer #1 (Public review):

      Summary:

      In this study, the authors propose that HSV-1 infection degrades the class I histone deacetylases HDAC1 and HDAC2. The MDM2 E3 ubiquitin ligase from the DNA damage response pathway is responsible for ubiquitinating these HDACs that are subsequently degraded via proteasomes. The authors hypothesize that HDAC degradation will cause hyperacetylation of viral chromatin and enable viral gene transcription.

      Strengths:

      The ubiquitination of HDAC1 & HDAC2 by Mdm2 and the mapping studies are clear.

      Weaknesses:

      (1) Degradation of HDACs is observed late, at least 12-24 h post-infection (1 PFU/cell). Viral genes have been transcribed by that point, and the virus has replicated its genome. The kinetics do not match the proposed model.

      (2) The authors need to connect these findings with their story. As of now, these findings are correlative. For example, what is the impact of MDM2 depletion on viral gene expression and progeny virus production? Leptomycin B is not specific to the HDAC cytoplasmic translocation, and its effect on the infection could be due to its effect on ICP27.

      (3) The time point when the inhibitors were added to the cultures has not been stated in any experiment. If inhibitors were added with the virus, viral gene expression would be blocked.

      (4) The authors need to present late gene expression data in all the experiments where drugs have been used.

      (5) Figure 1A, ICP4 is not detected up to 12 hours post-infection of HeLa cells with 1 PFU/cell. This cannot be true.

      (6) Leptomycin B blocks nuclear/cytoplasmic shuttling of ICP27 that brings viral mRNAs to the cytoplasm to be translated. So, the effect of LMB is not specific to the HDACs.

      (7) The key experiment is to use the degradation-resistant form of HDAC1 to evaluate its impact on viral gene transcription.

      (8) In the experiment where Mdm2 was depleted, the authors need to demonstrate the effect on the infection. ICP4 expression is not enough. How about growth curves? After Mdm2 depletion, ICP4 expression increases, which may contradict the authors' findings. An analysis of alpha and gamma gene expression is important.

      (9) Why did the authors analyze a liver HSV-1 infection and not a more relevant skin infection?

    3. Reviewer #2 (Public review):

      Summary:

      The authors discovered that HDAC1/2 are degraded in HSV-1 and PRV infections. They attempted to establish a new mechanism by which HDAC1/2 are translocated to the cytoplasm to be degraded in HSV-1 infection, and the degradation causes changes in histone acetylation to affect the DDR pathway.

      Strength:

      (1) Interesting findings of HDAC1/2 degradation during HSV-1 and PRV infection, and it may impact more than the virology field.

      (2) Significant work to identify the ubiquitin site in HDAC1/2 and K63 linkage.

      Weaknesses:

      (1) Insufficient evidence to support the mechanism described by the authors.

      (2) Expansion of the conclusion to alphaherpesvirus without studying the intended mechanism in PRV infection.

      Overall, there may be a correlation between HDAC1/2 level, ATM/ATR phosphorylation, and HDAC1 translocation during the HSV-1 infection. However, core evidence supporting the mechanism that a) HDAC1 export causes its degradation, b) degradation of HDAC1 causes histone acetylation changes and DRR activation has not been sufficiently demonstrated.

    4. Reviewer #3 (Public review):

      The authors state that infection of cells by the alphaherpesviruses HSV-1 or PRV leads to a proteosome-dependent reduction in levels of HDAC1 and HDAC2 and that this leads to chromatin hyperacetylation, a DNA damage response, and greater replication of these viruses. Previously, other authors reported no change in levels of HDAC1 and HDAC2 after HSV-1 infection of human cells, but this paper is neither cited nor commented on in this new submission. The experiments are poorly designed. For instance, most of the time points analysed are way beyond the time needed for HSV-1 replication and are therefore not biologically relevant. The infections are done with a dose of virus that does not ensure that all cells are infected synchronously, but rather infection spreads from cell to cell with multiple rounds of replication. Some essential controls are missing. Additionally, this reviewer feels that the data presented do not support the conclusions drawn. Currently, links are not established between a reduction in HDAC1/ 2 and other phenomena such as hyperacetylation of histones, a DDR, and altered virus replication. The paper does not identify which HSV or PRV protein(s) induce reduction in HDACs, nor how the HDACs mediate antiviral activity; what are the HSV-1 or PRV protein targets? Lastly, the paper is not well prepared, and it does not adequately refer to prior literature.

    1. eLife Assessment

      This useful study examines patterns of clonal reproduction and somatic mutations in 'Pando', a massive, quaking aspen clone consisting of ~47000 stems. Because the study relies on relatively low-coverage, reduced-representation genomic resequencing data for the detection of somatic mutations, the evidence provided for several of the primary conclusions about clone age and the relationship between mutation accumulation and geographic distance is incomplete.

    2. Reviewer #1 (Public review):

      Summary

      The authors use reduced-representation sequencing (GBS) across samples from the quaking aspen clonal stand Pando to identify putative somatic mutations, which were used to estimate clone age, and evaluate whether somatic variation shows spatial structure across the grove. This is a compelling and charismatic system to look at somatic mutation in plants. They report little sharing of putative somatic mutations as a function of distance and interpret this as evidence for weak mutation transmission or homogenization over time, potentially driven by rapid root growth and clonal spread dynamics. They use mutations to estimate clone age. The authors are generally upfront and commendably transparent about limitations in sequencing depth and mutation calling. The paper addresses an interesting research system, but struggles to overcome limitations in the suitability of the data.

      Strengths.

      This is a fantastic system and an interesting set of questions. The authors' GBS data does a great job distinguishing Pando from its neighbors, which is an important first step in studying the history of this clone.

      The manuscript is upfront and highlights the need for improved data to refine inference, for example: "Higher-coverage whole-genome sequencing, and ideally single-cell sequencing of defined meristem lineages, will be needed to refine mutational and evolutionary parameter estimates in this iconic organism."

      It also states that "either we are missing roughly 80% of true somatic mutations or only 20% of the mutations we detect are true positives."

      I appreciate that the authors report an age estimate range that considers the breadth of potential false negatives and positives.

      Weaknesses

      I am still not sure whether the paper overcomes issues with the use of GBS for somatic mutation calling.

      I found it difficult to reconcile the manuscript's description of the call set as "conservative" with the reported validation tests (calibrated by looking at retained variants detected in 2 of 8 technical replicates). How was this threshold determined? A mutation with 2/8 has quite low reproducibility, which could reflect either substantial false negatives under low depth (true variants frequently dropping out) or false positives that recur sporadically due to library - or sequencing-specific artifacts. Without stronger internal diagnostics or external validation, it is hard to determine which applies here.

      The GBS sequence space and genomic distribution could be more clearly explained. According to the methods, "The total number of base pairs sequenced(129,194,577) was estimated using angsd, and reduced following the proportion of base pairs that we filtered out because of low coverage (48%)." What does the 129M basepairs represent? Is that 129M/genome length, or is it the number of aligned basepairs (i.e., 1M genome covered x129 depth)? In addition, summarizing where GBS loci fall across the genome, genic vs intergenic vs TE; repetitive vs unique, since these can have substantially different somatic mutation rates (Meyer et al. 2025). Without additional summary/descriptive statistics, it is hard to interpret both missingness and "rate".

      Statistical concerns about some results. In the Figure 3 legend, the authors state that the sample-level relationship between shared variants and distance is significant: "Pearson correlation coefficient ... is −0.02, 95% CI = [−0.05, 0.00], which is significantly different from a randomized distribution (P < 0.001) (B)." However, as plotted in Figure 3B, the observed correlation (−0.02) appears to fall well within the bulk of the randomized distribution of correlation coefficients. If the reported P value is intended to be permutation-based (i.e., the tail probability under the randomized null), it is unclear how P could be < 0.001 given that the observed value does not appear extreme relative to the null.

      The developmental program of plant stem cell layers is essential, but not discussed much. In a root-spreading clone, expectations about mutation sharing depend strongly on how new ramets arise developmentally (root-derived meristem initiation) and how layered meristems partition mutations across tissues (e.g., L1/L2/L3). I was surprised there was not a substantial discussion of the details about the layer specificity of somatic development and mutation accumulation in plants. Especially relating to mutations that would be shared between roots/shoots around potential layer-specific growth of roots. The current analysis seems to focus on comparisons within tissue types (e.g., leaves between ramets), but did not report informative tests between tissue and within-ramet (e.g., in heavily sampled trees, whether a ramet's root, shoot, leaves, share a subset of variants; whether neighboring ramets share root-lineage variants more than shoot-lineage variants). It would help to articulate expectations and clarify what the data can and cannot test. Relatedly, for "mutation rates," in aging material, it would be good to discuss which meristem layer(s) each tissue is likely sampling and how layer-specific mutation dynamics (e.g., reported differences between L1 vs L2 lineages) could influence rate and therefore age estimates (Goel et al. 2024, Amundson et al. 2025).

      Developmental mosaicism makes expected allele fractions lower than discussed in the paper. The supplement states, "However, because the Pando clone is triploid, it reduces our expectation for fixation of a mutation to 0.33", but this ignores layer-specific stem cells in plant development. True that if calls are made against a haploid reference, then a new somatic mutation in a triploid background is expected around ~1/3 allele fraction - but only if fixed in 100% of cells. Layer-specificity (e.g., L1 vs L2 vs L3 restriction) or polyclonal founding events will push expected allele fractions substantially lower. Therefore, at ~12-14× depth (or min of 4x), these allele fractions translate into only a handful (or even 0) of alternate reads (<<33% is expectation).

      Within-tree replicate consistency was unclear. The manuscript hints at multiple samples/replicates per tree (e.g., Figure S2), but it is not clear how often the same putative somatic variants are recovered across samples from the same ramet and tissue. A simple reproducibility summary would be extremely helpful: for variants called in one sample, what fraction are recovered in other samples from the same tree (by tissue), what variant allele fractions, and how do their spectra compare to mutations unique to a single sample?

      The manuscript did not provide supplemental tables or mutation calls. Supplemental tables containing pre-filter and/or post-filter calls (or some other structured data file with flags indicating various quality metrics, REF vs ALT depths at minimum, REF call, and ALT call) would substantially improve transparency and ability to evaluate the work.

    3. Reviewer #2 (Public review):

      Summary:

      The topic of the paper is intriguing as it sets out to age one of the potentially largest living organisms, a tree clone (Pando), using shallow genome resequencing of a large number of replicate samples. The key result is that the Pando clone is several tens of thousands of years old, which is of high-interest to plant genomics and evolutionary ecology.

      Weaknesses:

      Unfortunately, the claims are not matched by the available data and their analysis. Probably, the results can also not be resurrected using modified analyses, as the available data are not suited to reliably detect somatic genetic variation as a means to age-clonal plants.

      In order to reliably age clones, one needs to consider the full process by which clone mates genetically diverge from one another over time, which starts with a plant's apical meristem (SAM). From this, all above-ground tissues such as twigs and branches, as well as leaves, are derived, which has been beautifully worked out now in oaks and many fruit trees (e.g., doi: 10.1101/2023.01.10.523380 ; 10.1101/2024.01.04.573414). For the accumulation and propagation of fixed somatic genetic variation, only the processes in the SAM matter. Hence, it does make little sense to look at tissue-specific mutations unless one is invoking non-cell division induced mutations through UV light. Those, however, would remain undetected with the present low-coverage sequencing as they cannot leave the mosaic status any more, as that tissue is essentially non-dividing.

      Somatic genetic drift (https://www.nature.com/articles/s41559-020-1196-4) is the foundation for the fixation of somatic genetic variation and hence, for ageing (plant) clones. It requires quantitative modeling of the processes at the cell-line level when new modules, here, aspen trees are formed, in particular N (cell population size) and N0 (founder cell size).

      Calibrations have to be made using the mutation and fixation rate at the somatic cell lineage level, ideally also with some empirical data. In trees such as aspen, it would be very easy to obtain calibration points of branch tips that have physically and thus genetically diverged upon a defined TCA to directly determine the rate of accumulation of somatic genetic variation by direct dendrochronology (i.e., counting tree rings).

      Instead, in the present work, a mutation rate from another tree species is taken, which will introduce a lot of uncertainty into the estimates, given that tree SAMs divide at a very different pace (see doi 10.1093/evolut/qpae150). It is clear that a small difference in the assumed mutation rate, e.g., a higher one, would conversely reduce the age estimate considerably.

      I am doubtful that a conventional phylogenetic model based on coalescence, such as the one employed here, can be utilized, as it assumes a sexually recombining population and hence variable sites. A model simulation on an asexually evolving population would be needed to check this.

      In order to reliably call somatic genetic variation, a decent coverage of short-read sequences is needed, definitely > 15x, which was achieved in the present dataset. This is particularly relevant as a fixation in one of the three haploid chromosome sets would just amount to a read frequency of only 0.33. A coverage of only 4x reads per called site seems very low to me; in other words, the filtering steps do not seem to be very rigorous to me. It is also difficult to follow the logic of several ad hoc adjustments that were made to compensate for the low coverage of sequencing, in particular, the common panel and the replicate identical samples. Why chose 80% in the latter?

      There are alternative, non-sequencing-based ways to double-check the accuracy of somatic SNP calls (e.g., described here https://www.nature.com/articles/s41559-020-1196-4), which could have been employed at least once to evaluate the error rates for the specific sequencing strategy.

      I also suggest that for any future study, reference to mutation callers developed for cancer somatic mutation detection should be employed, which are now increasingly used both in clonal plants and trees for that purpose.

      What worries me is that there is a poor correlation between physical and genetic distance. This lack of correlation among spatial and genetic structure, for example, the star-like phylogeny presented in Figure 6d, indicates a large fraction of false positives rather than some special, as yet unexplained processes of local mutation accumulation that the authors claim to have discovered.

      Finally, the work is not properly embedded into the current literature. For example, recent developments of molecular clocks were not considered, such as the development of a dedicated somatic genetic clock that precisely addresses this question (https://www.nature.com/articles/s41559-024-02439-z). Also, older but nevertheless significant work that aged aspen clones using microsatellite markers is not mentioned (http://dx.doi.org/10.1111/j.1365-294X.2008.03962.x).

    1. eLife Assessment

      This important study explores whether complex structures that are lost during evolution can re-evolve, which is a long-standing debate in evolutionary and developmental biology. The authors demonstrate that re-evolution can occur if the gene regulatory network that underlies the development of complex traits is maintained. The evidence supporting its conclusions is solid and the work will be of interest to those studying the evolution and development of complex traits.

    2. Reviewer #1 (Public review):

      Summary:

      The manuscript by Vasquez-Correa and colleagues describes the expression pattern of the ocelli (simple eye) gene regulatory network in ants. They correlate the expression pattern of these genes with the presence and absence of ocelli in different classes and species of ants. The presence of ocelli is a polyphenic trait in ants - understanding the molecular and developmental underpinnings of polyphenic traits is of significant interest to evolutionary biologists, developmental biologists, and ecologists. The authors propose that the presence of the latent expression of the ocellar network in classes of ants that do not display ocelli in the adults may underlie the re-evolution of ocelli within the ant lineage.

      Strengths:

      The strengths of the manuscript are that it is well written, the images are of the highest quality, and the data support the conclusions of the authors.

      Weaknesses:

      One improvement that could be made is to include imaginal discs of the queen ants as well as scanning electron images of the ocelli of the queen ant to match the pupal stage images of the worker and soldier ants. A second improvement is to attempt a gene knockdown using RNAi or similar methods to ensure that the genes that are being studied are, in fact, responsible for ocelli development in the ant.

    3. Reviewer #2 (Public review):

      Summary:

      The manuscript titled "Latent gene network expression underlies partial re-evolution of a polyphenic trait in the worker caste of ants" by Vasquez-Correa et al. aimed to study genetic mechanisms underlying developmental plasticity, especially binary polyphenism in queen vs worker ant castes. This is an interesting question regarding the extent to which phenotypic traits were altered, lost or regained, and how molecular pathways (upstream vs. downstream) can facilitate this process.

      In ants, reproductive castes (queens and males) develop wings as well as 3 ocelli for mating flights and other activities, while worker castes are wingless, and in some species, they have either no or a reduced number of ocelli. The phylogenetic analysis showed that in the Camponotini ant clade, the one-ocellus phenotype re-evolved in three species independently. The authors analyzed the conserved developmental pathways between Drosophila (well-established) and ants using HCR (a high-quality in situ hybridization technique). They found that although upstream genes for the development of ocelli (otd and hh) showed similar expression between castes, downstream genes (toy, eya, and so) had reduced or no expression in workers of C. floridanus, and this differential expression may lead to partial or complete loss of ocelli. Consistently, workers develop rudimentary tissues, suggesting that they initiate the ocellus developmental process but somehow stop it before adulthood.

      Strengths:

      Evo-devo approaches to reveal conserved molecular pathways of ocellus development. High-quality HCR provided convincing evidence of the expression of key genes in ocelli, eyes and antenna throughout larval development.

      Using HCR, the authors showed differential expression of downstream genes in males vs. soldiers vs. minor workers of C. floridanus, which might explain phenotypic differences between castes.

      Weaknesses:

      Although the molecular pathway is conserved, the mechanism underlying the lack of ocelli in workers remains unclear. In C. floridanus, it could be explained by the evidence of no expression of certain developmental genes, but in other species, e.g. Polyrachis rastellata, is their expression intact, or reduced? There is no control male.

      In addition, HCR in species with partial re-evolution (if their genomes have been sequenced) would be useful to understand the mechanism. For example, there might be differential spatial expression between medial and lateral ocelli.

    4. Reviewer #3 (Public review):

      Summary:

      This paper examines the loss and re-evolution of specific organs during the evolution of ants. The authors show that these organs, the ocelli, disappear and are re-evolved in different ant species and in different ant castes within these species. The authors show that this is linked to dto a conserved GRN discovered in Drosophila, that appears to underlie the development of the ocelli, and demonstrate that this GRN appears to remain active in the developing heads of ants that have no ocelli- implying that it is the evolutionary latency of this GRN that allows loss and subsequent evolution.

      Strengths:

      This manuscript has outstanding imaging of a very difficult developing organ, and the key data, fluorescence in situ hybridisation, is done well and clearly shows what the authors wish to demonstrate. The methods are well described and underpin the whole work.

      The authors convincing demonstatrate that gene expression patterns imply the conservation of the ocellus gene regulatory network from Drosophila to ants. They further show that this network is present even in ants that don't produce an adult ocellus, but do show that in those species, loss of a developing nascent ocellus (which they identify) occurs at the same time as an interruption in the expression of the key genes in the GRN. All of this data is beautifully presented and explained.

      Weaknesses:

      There is one key weakness in that there are no functional students that indicate that the GRN actually does make the ocellus, though the expression patterns are convincing. This applies to loss of the ocellus as well. It would be nice to see that transient loss of the ocelli GRN might lead to loss of ocelli in ant species that have them. These are very difficult things to achieve, as the key genes have earlier developmental roles, such that CRISPR knockouts would not be interpretable, and transient RNAi in the head capsules of developing pupal ants would be challenging.

    1. eLife Assessment

      This important study provides new insight into the regulation of cell organization and division in Trypanosoma brucei through the control of a kinesin motor protein by a polo-like kinase. The authors present solid evidence from rigorous biochemical and imaging analyses showing that phosphorylation modulates kinesin function and cellular organization. However, direct in vivo evidence that PLK phosphorylates kinesin-G is lacking.

    2. Reviewer #1 (Public review):

      Summary:

      This manuscript identifies the orphan kinesin KIN-G as a substrate of Polo-like kinase (TbPLK) in Trypanosoma brucei and demonstrates that phosphorylation of Thr301 inhibits KIN-G microtubule binding and disrupts its cellular function. Using a combination of in vitro kinase assays, phosphosite mapping, microtubule binding and gliding assays, and in vivo complementation with phosphomimetic and phosphodeficient mutants, the authors link TbPLK-mediated regulation of KIN-G to defects in centrin arm integrity, FAZ elongation, Golgi organization, flagellum positioning, and division plane placement. The study provides a mechanistic advance in understanding how TbPLK regulates centrin arm biogenesis and integrates KIN-G into the growing regulatory network controlling hook complex and FAZ assembly. Overall, the work is technically strong, internally consistent, and builds logically on previous studies from this group and others.

      Strengths:

      A major strength of the manuscript is the clear mechanistic link between phosphoryltion of Thr301 and loss of microtubule binding activity. The use of phosphomimetic (T301D) and phosphodeficient (T301A) mutants in an RNAi-rescue framework provides a clean and convincing demonstration of functional relevance in vivo. The integration of biochemical assays with detailed cell biological phenotyping (centrin arm length, FAZ elongation, basal body segregation, and cytokinesis markers) is particularly effective and makes the central conclusion robust. The observed phenotypic cascade from centrin arm defects to FAZ and division plane abnormalities is also well aligned with existing models of trypanosome morphogenesis.

      Weaknesses:

      My (more or less main) concern relates to the interpretation of the Golgi phenotype. The conclusion that phosphorylation of KIN-G "impairs Golgi biogenesis" is currently based on fluorescence microscopy using TbGRASP and Sec13 markers and on quantification of the number and distribution of Golgi/ERES puncta in binucleated cells. While these data convincingly demonstrate altered Golgi/ERES number and spatial organization, they do not distinguish between true defects in Golgi biogenesis or duplication and alternative possibilities such as fragmentation, vesiculation, or mislocalization of Golgi membranes. Given the central role of Golgi-centrin arm organization in the proposed model, ultrastructural analysis (for example, by EM or electron tomography) would greatly strengthen this aspect of the study by providing direct evidence for structural alterations of the Golgi and its association with the centrin arm and ERES. Such data would elevate this part of the manuscript from a descriptive fluorescence phenotype to a true structural cell biological insight. I appreciate that this experiment goes beyond the current dataset, but it would substantially enhance the mechanistic depth of the Golgi-related conclusions and strengthen the causal chain linking centrin arm defects to Golgi abnormalities. However, I have to confess, the inclusion of such data would make this reviewer particularly enthusiastic about the work. If this is not feasible, I would recommend tempering the wording of "Golgi biogenesis" to a more conservative description, such as altered Golgi organization or duplication, and explicitly acknowledging the limitations of fluorescence-based analysis for this conclusion.

      An additional conceptual point concerns the dual role of TbPLK in centrin arm regulation. TbPLK is known to promote centrin arm biogenesis through phosphorylation of TbCentrin2, yet in this study, TbPLK phosphorylation of KIN-G negatively regulates centrin arm assembly. This dual positive and negative regulatory role is intriguing but could be discussed more explicitly. The manuscript would benefit from a clearer conceptual framework addressing how phosphorylation of KIN-G might serve as a temporal or spatial switch to restrain KIN-G activity at specific stages of centrin arm assembly.

      Finally, a schematic model summarizing the proposed regulatory pathway from TbPLK phosphorylation of KIN-G to centrin arm assembly, FAZ elongation, division plane placement, and Golgi organization would aid the reader.

    3. Reviewer #2 (Public review):

      Summary:

      The authors identify KIN-G as an in vitro substrate for phosphorylation by TbPLK and show that several of the in vitro P-ated sites, including T310, overlap with P-ation sites seen in live cells. The authors further show that PLK-mediated P-ation inhibits KIN-G binding to microtubules in vitro, as does a KIN-G-T301D mutant, and that expression of a KIN-G-T301D Phospho-mimic in T. brucei phenocopies KIN-G RNAi knockdowns, producing defects in cell division, morphogenesis of the centrin arm, FAZ and other cellular structures, as well as a misplaced cytokinesis furrow.

      Understanding cytoskeletal rearrangements that drive cell division in T. brucei is an important and unresolved problem, so the work addresses important questions that are of great interest. PLK and KIN-G have previously been shown to be important for cell division and morphogenesis of cytoskeletal structures that drive cell division in T. brucei. The current work advances our understanding by suggesting a potential mechanism by which PLK and KIN-G might participate, namely through PLK-dependent P-ation to control KIN-G MT binding activity.

      Strengths:

      The authors use a rigorous combination of biochemistry, phosphoproteomics, cell biology, and mutant analysis to support their conclusion that PLK-mediated P-ation of KIN-G negatively regulates KIN-G microtubule binding, and this may explain the observation that a KIN-G T301 phosphomimic mutant blocks cell division and perturbs biogenesis of cytoskeletal structures that drive cell division and morphogenesis. Combining rigorous and informative in vitro studies with mutant analysis in live cells is a great strength. The work is solid and important, though a few pieces are needed to fully connect the in vitro findings with the in vivo observations, as detailed below.

      Weaknesses:

      Overall, I find this work to be solid and to provide an important addition to our understanding of mechanisms controlling cell division in T. brucei. The biochemistry, in particular, is rigorous and convincingly demonstrates PLK can P-ate KIN-G, altering its MT-binding ability. Analysis of phospho-mutants of KIN-G in live T. brucei supports the conclusion that P-ation of KIN-G at T301 negatively affects KIN-G function in vivo. I think, however, that the results fall short of supporting the title, because, although the data convincingly show that PLK can phosphorylate KIN-G at T301 in vitro, and that T301 is P-ated in vivo, they do not formally demonstrate (nor even test) whether PLK is the kinase responsible for this phosphorylation in vivo (experiments to address this seem quite feasible). I also do not see where the authors try to reconcile the absence of phenotype for KIN-G-T301A with the implied importance of KIN-G phosphorylation by PLK in cell division, which calls into question the need for P-ation of KIN-G-T301 in cell division. Suggestions for addressing these concerns are provided below.

      My two main questions are:

      (1) What is the biological relevance of KIN-G P-ation at T301?

      a) The authors report no defect for the KIN-G-T301A mutant, so what then is the need for T301 P-ation, if the cell gets along fine without it? One step toward addressing this would be to ask what fraction of KIN-G shows P-ation at T301. Although published studies indicate P-ation at T301, it isn't known what percentage of KIN-G in the cell is P-ated. One might anticipate, for example, that T301-P is a small minority of the population in asynchronous cultures and that T301 P-ation increases at specific cell cycle stages.

      b) Published work links PLK to cell division, FAZ elongation, etc... The current work suggests that one role of PLK is to P-ate KIN-G at T301. In contrast, however, the current work also indicates that P-ation of KIN-G at T301 is unnecessary for normal cell division, FAZ elongation, etc....

      c) Some experiments or at least commentary on points a and b above would strengthen the paper.

      (2) Is PLK the kinase that P-ates Kin-G T301 in vivo?

      a) The authors show PLK P-ates T301 (and other residues) in vitro, and that T-301 is P-ated in vivo. To bring the analysis full circle, it would be informative to examine KIN-G P-ation in a PLK mutant or upon inhibition of PLK with published inhibitors. This seems to be a very doable experiment with the tools available.

    4. Reviewer #3 (Public review):

      Summary:

      Here, the authors investigate the role of the Trypanosoma brucei polo-like kinase TbPLK in the function of flagellum-associated cellular structures in trypanosomes. They set out to test the hypothesis that a key substrate of TbPLK is the kinesin protein KIN-G, and that TbPLK phosphorylation of KIN-G regulates its functions in cells.

      Strengths:

      Using in vitro biochemistry with purified proteins, the authors convincingly demonstrate that TbPLK phosphorylates KIN-G at 29 sites. Moreover, they convincingly show that phosphorylation at one site, T301, impairs the binding of purified KIN-G to purified microtubules. Using immunofluorescence-based imaging approaches, they also show that TbPLK colocalizes with KIN-G at centrin arms during the early S-phase of the cell cycle. Centrin arms are structures that are located near the basal body and flagellum and are important for new flagellum biogenesis, Golgi positioning, and cell division. To evaluate the function of KIN-G phosphorylation in cells, they depleted KIN-G by RNAi, simultaneously expressed phospho-mimetic (T301D) and phospho-ablative mutant proteins, and used immunofluorescence to examine the impact on flagellum-associated cellular structures. They show that expression of the phospho-mimetic mutant KIN-G-T301D causes the following defects: reduced cell proliferation, disruption of centrin arm and Golgi biogenesis, impairment of FAZ elongation and flagellum positioning, and misplacement of the cell division plane. The data convincingly support the conclusion that KIN-G phosphorylation on T301 plays an important role in regulating the cellular functions of this kinesin motor protein.

      Weaknesses:

      Some of the broader conclusions are not directly supported by the data. For example, the title states "Polo-like kinase phosphorylation of the orphan kinesin KIN-G negatively regulates centrin arm biogenesis in Trypanosoma brucei," but the data do not directly address the specific role of TbPLK in phosphorylating KIN-G in cells. Moreover, some of the more specific conclusions in the paper, for example, that "phosphorylation of KIN-G" causes various cellular defects, are a bit of an overstatement. The supporting data rely on the expression of a phospho-mimetic mutant of KIN-G. Presumably, phosphorylation in cells is a normal part of KIN-G regulation, and it is not just phosphorylation, but rather hyperphosphorylation that is being mimicked by the mutant. Some rewording of the specific conclusions is warranted, and the broader conclusion would be better supported with additional experimental evidence.

    1. eLife Assessment

      This valuable study uses a large cohort of clinical malaria cases collected over 18 years to address a critical knowledge gap regarding the role of PfEMP1 variants across distinct severe malaria syndromes. The conclusions are potentially of importance and interest to those who study malaria severity, but the evidence is incomplete, largely due to a lack of clarity on data inclusion and the correct use of statistical tests. More up-to-date data analysis methods would further strengthen the conclusions.

    2. Reviewer #1 (Public review):

      Summary:

      Severe childhood malaria is associated with three main overlapping syndromes: impaired consciousness (IC), respiratory distress (RD), and severe malaria anaemia (SMA). One central feature of severe malaria, driven by host and parasite factors, is the sequestration of parasitized red blood cells in vascular beds, leading to impaired tissue perfusion and lactic acidosis. The causing agent, the parasite ligand PfEMP1, is expressed on the surface of infected red blood cells, where it binds to a broad range of different endothelial receptors. Accumulation of parasite-infected erythrocytes in the host's microvasculature has been repeatedly confirmed for cerebral malaria, but there are scarce data on the extent of sequestration in the other severe malaria syndromes. However, the absence of effective adjunctive therapies for severe malaria implies that our understanding of its pathogenesis remains incomplete. Thus, by comparing var gene expression from a large Kenyan cohort (n=372 severe cases; n=340 non-severe cases), this study addresses a critical knowledge gap regarding the role of PfEMP1 across distinct severe malaria syndromes. The substantial sample size, phenotypic stratification, and use of two complementary methods (DBLa-tag sequencing and RT-qPCR), along with data about the parasite's ability to form rosettes and antibody level assessments, provide a strong setup. Var gene expression data - either proportions of different DBLa-tags classified by the number of cysteine residues and presence of particular motifs or relative expression RT-qPCR data from a set of primer pairs targeting conserved regions of var groups or particular domains - is associated with (a) severe malaria syndromes, (b) variant expression homogeneity, (c) rosetting ability, and (d) mortality using independent linear regression models, spearman ranks correlations, or logistic regression models. In summary, the study confirms that A-type and DC8-containing gene expression correlate with IC, that RD is associated with rosetting, and that SMA is linked to a high variant expression homogeneity (VEH) of var-A expression, which may indicate a longer infection duration. However, some findings remain inconclusive. For example, when analyzing pure syndromes, several associations changed: DC8 expression was also found to be significantly enriched in SMA (with multiple primer pairs) and RD, not exclusively with IC. Additionally, rosetting was associated with DC8 expression but not with IC, even though IC itself is linked to DC8 expression. Overall, the findings are significant and supported by a large dataset, though the reported evidence remains largely associative rather than mechanistic.

      Strengths:

      As the authors stated themselves, one of the key unresolved questions is whether severity-causing parasites are biologically different from parasites responsible for asymptomatic infections. This study is among the first to address this question using data from a large, phenotypically stratified cohort. The use of two complementary methods (DBLa-tag sequencing and RT-qPCR), together with data on the parasites' ability to form rosettes and assessments of antibody levels, provides an excellent experimental framework.

      Weaknesses:

      Even when assessing var gene expression using two different approaches - DBLα-tag sequencing and RT-qPCR targeting pre-defined variants - only a glimpse of the parasites' actual biology is captured. Moreover, a well-known confounder in gene expression studies of P. falciparum field isolates is variation in parasite age (hours post-invasion) or synchronicity, both of which significantly influence var gene expression. The methods employed in this study, unfortunately, do not allow for controlling or correcting for these factors. Then, the old classification system of DBLa-tag data developed by Bull et al is certainly still valid; however, more recent advances in bioinformatic tool development now allow for a more in-depth exploration of DBLa-tag datasets. Tools such as Varia (doi: 10.1186/s12859-022-04573-6), cUPS (https://doi.org/10.1371/journal.ppat.1012813), and upsML (doi: https://doi.org/10.1101/2025.05.19.654848) enable the prediction of DBLa-tag-connected PfEMP1 domains and the var group affiliations.

      As A-type var gene expression has already been associated with severity, most expression studies (including this one) have a selection bias towards A- and B/A-type var genes. Here, A- and B/A-types are covered by 8 primer pairs (gpA1, gpA2, 4x DC8, DC13, DC4), whereas high polymorphic B-types are targeted by only 2 primer pairs (b1, DC9) and C-types only by a single primer (c2). Thus, any association with A-type expression is more likely to be observed, although evidence is accumulating that parasites are preferably expressing B-type var genes at the onset of blood stage infection in naïve/less immune individuals; this is also consistent with the observation of the authors that VEH is positively associated with immunity (measured as anti-IE) and negatively associated with temperature.<br /> I am not an expert in biostatistics, but to my understanding, independently performed regressions should be corrected for multiple testing.

      Overall, the authors largely achieved their aims, identifying specific var groups associated with different severity syndromes. However, due to the complexity of var gene data and the interdependence of parameters, the resulting picture is not entirely clear. Some opposite results between different analyses may also be difficult for the reader to interpret. Nevertheless, this study can be considered a pioneering effort, providing valuable insights into the complex interplay of var gene expression across different severity syndromes and offering useful data for the field. Follow-up studies will be important to validate these findings and further dissect the mechanisms linking parasites gene expression to clinical outcomes.

    3. Reviewer #2 (Public review):

      Summary:

      The manuscript presents results of a study using two complementary approaches (RT-qPCR and DBL) to analyze the putative relationship between var gene transcription (and hence, PfEMP1 expression) and clinical presentation among Kenyan children with Plasmodium falciparum malaria. Binary rosetting (yes/no) data are used in a similar way. The study includes samples collected over a period of almost 20 years from about 700 children presenting with either severe (impaired consciousness [IC], respiratory distress [RD], severe anemia [SA]) or non-severe malaria. During the study period, the study area experienced a remarkable drop in P. falciparum transmission intensity.

      Strengths:

      The study stands on the shoulders of many similar studies of this kind, both by the authors and by other research teams, and the inferences made largely confirm those made previously. The current study has analytical rigor and a large sample size. Disentangling the multiple parameters of the above-mentioned relationship is of obvious and crucial importance to an improved understanding of P. falciparum malaria pathogenesis and of the targets and mechanisms of protective immunity to the disease. The present study is a valuable effort towards that. The study is well-structured, and the figures are clear.

      Weaknesses:

      It is somewhat unclear to this reviewer to what extent the samples and data analyzed and reported here are new (i.e., not used/analyzed in previous studies). If there is substantial overlap with earlier studies, this is a weakness because of the risk of circular inferences. The Discussion section would benefit from less repetition of the results section and a more in-depth discussion of the findings obtained relative to the existing literature. Better inclusion of key primary references is recommended.

    4. Reviewer #3 (Public review):

      Summary:

      In this manuscript, Ndugwa et al. attempt to link specific severe malaria manifestations with particular var gene expression patterns. This is an important question, and the dataset the authors have assembled over decades is impressive. However, greater clarity in the descriptions and statistics would, in my view, help this reviewers, and readers in general develop a more precise understanding of the significance of the findings.

      Strengths:

      The study addresses a critically important question in malaria pathogenesis, and the dataset is extensive and represents a significant long-term effort by the authors.

      Weaknesses:

      The Results section often lacks clarity: clinical group definitions (NS, non-IC, non-SMA, mild vs. moderate) are sometimes ambiguous, and key methodological details, including the VEH index calculation, RT-qPCR quantification, antibody detection methods, and rosetting assays, are either missing from the results text or poorly explained in the figure legends. Additionally, figure presentation requires improvement, with inconsistent reporting of sample sizes, undefined colors, and p-values that overlap with data points rather than being clearly displayed above them.

    1. eLife Assessment

      This important study presents a novel immunotherapy strategy for cancer. The authors develop a whole-tumor cell vaccine comprised of senescent tumor cells and a COX2 inhibitor in a hydrogel matrix. They present convincing evidence of the efficacy of this approach in preclinical models, demonstrating that prostaglandin E2 (PGE2) modulates the senescence-associated secretory phenotype (SASP) toward an immunostimulatory state, although more mechanistic/functional work would strengthen their conclusions. This work is timely and will be of interest to immunologists and others interested in the development of novel cancer therapies.

    2. Reviewer #1 (Public review):

      Summary:

      The authors aimed to overcome the limitations of whole-tumor-cell vaccines, specifically the weak immunogenicity and rapid clearance often associated with them. They leveraged the unique properties of senescent tumor cells (STCs), which remain metabolically active and secrete chemokines, as a source of antigens. However, to counteract the secretion of the immunosuppressive lipid prostaglandin E2 (PGE2), which is part of the senescence-associated secretory phenotype (SASP), they engineered a hydrogel vaccine formulation (STCs+CLX-Lipo@Gel) containing STCs and liposomal celecoxib (a COX2 inhibitor).

      Strengths:

      (1) The study is conceptually strong in its approach to leveraging the SASP to improve immunotherapy responses. By selectively inhibiting COX2/PGE2 while preserving the secretion of recruitment chemokines (like CCL2 and CCL5) in the SASP, the authors successfully turn a potentially deleterious cellular state into a therapeutic asset.

      (2) Mechanistic Insight: The manuscript provides detailed evidence regarding the mechanism of action. The authors convincingly show that the vaccine restores activity in the NK-DC axis. Specifically, they demonstrate that reducing PGE2 levels enhances NK cell activation (upregulation of NKG2D and NKp46) and promotes the secretion of CCL5 and XCL1 by NK cells, which subsequently recruits cDC1 dendritic cells.

      (3) The therapeutic potential is tested across multiple models, including a subcutaneous melanoma model, a difficult-to-treat melanoma brain metastasis model, and an orthotopic pancreatic cancer model. The consistent efficacy across these distinct physiological contexts suggests broad applicability.

      Weaknesses:

      (1) While the authors successfully inhibit PGE2, the SASP is a complex cocktail of factors. The discussion regarding the long-term presence of these "live" senescent cells is somewhat limited. Although the hydrogel retains cells locally, the potential for other chronic inflammatory factors to eventually promote tumorigenesis or tissue damage in the surrounding niche warrants careful consideration when translating this approach to patients and may require additional preclinical testing.

      (2) The study posits that STCs serve as an antigen reservoir. However, the manuscript would benefit from a clearer distinction between whether the immune system is recognizing senescence-specific neoantigens or simply shared tumor antigens that are being presented more effectively due to the adjuvant effect. The authors briefly touch upon neoantigens in the discussion, but the experimental data primarily measure general anti-tumor responses.

      Impact:

      This work bridges material science and immunology, offering a practical solution to the immunosuppressive barriers of cell-based vaccines. It provides a platform that could potentially be adapted for various solid tumors.

    3. Reviewer #2 (Public review):

      Summary:

      Wang et al. examined an engineered whole-tumor-cell vaccine based on senescent tumor cells co-encapsulated with liposomal celecoxib in a chitosan hydrogel. The authors propose that prolonged persistence of senescent cells, combined with COX2/PGE2 inhibition, restores NK-DC crosstalk, enhances cDC1 recruitment, and ultimately drives robust CD8⁺ T-cell-mediated antitumor immunity. The study is nicely executed and clearly presented, with extensive in vitro and in vivo validation across multiple tumor models, including melanoma brain metastases and orthotopic PDAC. While the overall concept is timely and of potential interest, several mechanistic conclusions are based primarily on correlative evidence and would benefit from additional functional experiments to strengthen causal interpretation and translational relevance.

      Strengths:

      (1) Strong conceptual framework

      (2) Impressive breadth of in vivo models.

      (3) Clear immunological readouts.

      (4) Innovative combination of senescence biology and biomaterials.

      Weaknesses:

      (1) Mechanistic conclusions rely heavily on correlation.

      (2) Lack of functional immune cell depletion studies.

      (3) Limited exploration of long-term safety and antigenic specificity.

      Major Critiques:

      (1) The authors emphasize the expansion and activation of cDC1 as a key mechanism linking innate and adaptive immunity, yet it does not directly test whether cDC1 is required for the observed CD8⁺ T-cell responses and tumor control.

      The authors should perform experiments using Batf3-deficient mice or any other cDC1-depletion strategies to provide important mechanistic validation. If such experiments are not feasible, this limitation should be more clearly acknowledged and discussed.

      (2) The authors note that senescence may generate neoantigens distinct from those present in proliferating tumor cells, but the extent to which STC-induced immunity cross-reacts with non-senescent tumor cells is not fully addressed. While it is appreciated that tumor challenge experiments are included, the author should perform a more explicit analysis of antigenic overlap that would strengthen the translational relevance of the approach. For example, they can compare senescence induced by different stimuli or directly assess immune recognition of non-senescent tumor targets, which would help clarify whether the vaccine primarily exploits senescence-specific antigens or broadly shared tumor antigens.

      (3) Hydrogel encapsulation clearly extends STC persistence in vivo; however, the study provides limited information on the eventual clearance of these cells and the potential implications of prolonged SASP exposure. Given general concerns regarding chronic inflammation associated with senescent cells, additional discussion of long-term local and systemic responses would be helpful. If extended safety analyses are beyond the scope of the current study, the authors should acknowledge the limitation.

      (4) The immunological effects are attributed to COX2/PGE2 inhibition, but it remains unclear whether these effects are specific to celecoxib or could reflect formulation-dependent or off-target mechanisms. The authors may perform additional experiments employing an alternative COX2 inhibitor, genetic COX2 suppression, or PGE2 rescue, which could further support the specificity of the COX2/PGE2-dependent mechanism.

    1. eLife Assessment

      This important work describes systematic computational and experimental approaches to turn a moderately stable α-helical bundle into a very stable fold. The authors advance our understanding of α-helix stabilization providing a convenient framework that has general implications for the protein design field. The main claims have convincing support through a sound methodology, with strong specific conclusions for designing mechanically, thermally, and chemically stable α-helical bundles.

    2. Reviewer #1 (Public review):

      Summary:

      In the work from Qiu et al. a workflow aimed at obtaining the stabilization of a simple small protein against mechanical and chemical stressors is presented.

      Strengths:

      The workflow makes use of state-of-the-art AI-driven structure generation and couples it with more classical computational and experimental characterizations in order to measure its efficacy.

      The work is well presented and results are thorough and convincing.

      The Methods description is quite precise, and some important details were added during review.

      Weaknesses:

      The pulling velocity is quite high but in accordance with this observation the results were only used for comparative analyses.

      Following the review process the authors have shown that the minimum distance between each protein from its periodic images was consistently above 1 nm, yet towards the end of some simulations the value crosses the non-bonded interaction cut-off distance.

      Comments on revisions:

      The authors did a good job in addressing the reviews.

    3. Reviewer #2 (Public review):

      Summary:

      Qiu, Jun et. al., developed and validated a computational pipeline aimed at stabilizing α-helical bundles into very stable folds. The computational pipeline is a hierarchical computational methodology tasked to generate and filter a pool of candidates, ultimately producing a manageable number of high-confidence candidates for experimental evaluation. The pipeline is split into two stages. In stage I, a large pool of candidate designs is generated by RFdiffusion and ProteinMPNN, filtered down by a series of filters (hydropathy score, foldability assessed by ESMFold and AlphaFold). The final set is chosen by running a series of steered MD simulations. This stage reached unfolding forces above 100pN. In stage II, targeted tweaks are introduced - such as salt bridges and metal ion coordination - to further enhance the stability of the α-helical bundle. The constructs undergo validation through a series of biophysical experiments. Thermal stability is assessed by CD, chemical stability by chemical denaturation, and mechanical stability by AFM.

      Strengths:

      A hierarchical computational approach that begins with high-throughput generation of candidates, followed by a series of filters based on specific goal-oriented constraints, is a powerful approach for a rapid exploration of the sequence space. This type of approach breaks down the multi-objective optimization into manageable chunks and has been successfully applied for protein design purposes (e.g., the design of protein binders). Here, the authors nicely demonstrate how this design strategy can be applied to successfully redesign a moderately stable α-helical bundle into an ultrastable fold. This approach is highly modular, allowing the filtering methods to be easily swapped based on the specific optimization goals or the desired level of filtering.

      Weaknesses:

      Assessing the change in stability relative to the WT α-helical bundle is challenging because an additional helix has been introduced, resulting in a comparison between a three-helix bundle and a four-helix bundle. Consequently, the appropriate reference point for comparison is unclear. A more direct and informative approach would have been to redesign the sequence of the original α-helical bundle of the human spectrin repeat R15, allowing for a more straightforward stability comparison.

      The three constructs chosen are 60-70% identical to each other, either suggesting over-constrained optimization of the sequence, or a physical constraint inherent to designing ultrastable α-helical bundles. It would be interesting to explore whether choosing a different combination of filters would enable ultrastable α-helical bundles constructs with a more varied sequence content.

      While the use of steered MD is an elegant approach to picking the top N most stable designs, its computational cost may become prohibitive as the number of designs increases or as the protein size grows, especially since it requires simulating a water box that can accommodate a fully denatured protein.

      Comments on revisions:

      The authors have done a good job of addressing the comments.

    4. Reviewer #3 (Public review):

      Summary:

      Qiu et al., present a hierarchical framework that combine AI and molecular dynamic simulation to design α-helical protein with enhanced thermal, chemical and mechanical stability. Strategically chemical modification by incorporating additional α-helix, site-specific salt bridges and metal coordination further enhanced the stability. The experimental validation using single-molecule force spectroscopy and CD melting measurements provide fundamental physical chemical insights into the stabilization of α-helices. Together with the group's prior work on super-stable β strands (https://www.nature.com/articles/s41557-025-01998-3), this research provides a comprehensive toolkit for protein stabilization. This framework has broad implications for designing stable proteins capable of functioning under extreme conditions.

      Strengths:

      The study represents a complete frame work for stabilizing the fundamental protein elements, α-helices. A key strength of this work is the integration of AI tools with chemical knowledge of protein stability.<br /> The experimental validation in this study is exceptional. The single-molecule AFM analysis provided a high-resolution look at the energy landscape of these designed scaffolds. This approach allows for the direct observation of mechanical unfolding forces (exceeding 200 pN) and the precise contribution of individual chemical modifications to global stability. These measurements offer new, fundamental insights into the physicochemical principles that govern α-helix stabilization.

      Weaknesses:

      (1) While the initial manuscript lacked a detailed explanation for the stabilizing effect of the additional helix, the revised version now includes a clear structural basis for this improvement. The authors successfully attribute the increased unfolding force threshold to the reinforcement of the hydrophobic core and enhanced cooperative interactions, supported by relevant literature correlations between helix bundle size and stability.

      (2) The author analyzed both thermal stability and mechanical stability. It would be helpful for the author to discuss the relationship between these two parameters in the context of their design. Since thermal melting probes equilibrium stability (ΔG), while mechanical stability probes the unfolding energy barriers along pulling coordinate. While the integrative design approach successfully improved both stability types, a deeper exploration of how the specific structural modifications influence the unfolding energy barrier relative to the overall equilibrium stability would further strengthen the mechanistic impact of the work.

      (3) While the current study demonstrates a dramatic increase in global stability, the analysis focuses almost exclusively on the unfolding (melting) process. However, thermodynamic stability is a function of both folding (kf) and unfolding (ku) rates. The author have clarified that the observed ultrastability likely originates from a significantly reduced unfolding rates, a hypothesis consistent with the unfolding force. Direct measurements of the kinetics would provide deeper insights.

      (4) The authors chose the spectrin repeat R15 as the starting scaffold for their design. R15 is a well-established model known for its "ultra-fast" folding kinetics, with folding rates (kf ~105s), near three orders of magnitude faster than its homologues like R17 (Scott et.al., Journal of molecular biology 344.1 (2004): 195-205). Measuring the folding rates of newly designed proteins would provide additional insights into the design.

      Comments on revisions:

      I think the author have addressed comments.

    5. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Summary:

      In the work from Qiu et al., a workflow aimed at obtaining the stabilization of a simple small protein against mechanical and chemical stressors is presented.

      Strengths:

      The workflow makes use of state-of-the-art AI-driven structure generation and couples it with more classical computational and experimental characterizations in order to measure its efficacy. The work is well presented, and the results are thorough and convincing.

      We are grateful to this reviewer for his/her thoughtful assessment and supportive feedback. In response, we have addressed each comment and incorporated the necessary revisions into the manuscript.

      Weaknesses:

      I will comment mostly on the MD results due to my expertise.

      The Methods description is quite precise, but is missing some important details:

      (1) Version of GROMACS used.

      We used GROMACS version 2023.2 (single-precision). All subsequent MD simulation procedures mentioned below have been consolidated and described in detail in the Supporting Information (SI).

      (2) The barostat used.

      Pressure coupling was applied using the C-rescale barostat (τ<sub>p</sub> = 5.0 ps, ref<sub>p</sub> = 1.0 bar).

      (3) pH at which the system is simulated.

      No explicit pH was defined during system construction. Proteins were modeled using standard protonation states as assigned by GROMACS preprocessing tools, corresponding to physiological, near-neutral pH (~ 7.0).

      (4) The pulling is quite fast (but maybe it is not a problem)

      The relatively high pulling velocity (1 nm/ns) was selected to enable efficient screening across a large number of designed proteins (211 candidates), while maintaining reasonable computational cost/time. Given the intrinsic orders-of-magnitude difference between simulation and experimental pulling rates, SMD results were used as a comparative screening tool, rather than for direct quantitative comparison with AFM data.

      (5) What was the value for the harmonic restraint potential? 1000 is mentioned for the pulling potential, but it is not clear if the same value is used for the restraint, too, during pulling.

      All positional restraints used in the simulations, including those applied during equilibration as well as the harmonic restraint on the N-terminus and the pulling umbrella restraint during SMD, employed the same force constant (k = 1000 kJ·mol<sup>–1</sup>·nm<sup>2</sup>). We have clarified this point in the revised Methods section.

      (6) The box dimensions.

      Rectangular simulation boxes were used throughout. For equilibrium MD simulations, the box dimensions in each direction were set based on the maximum extent of the protein along that axis, with a minimum distance of 1.2 nm between the protein surface and the box boundary on all sides. For SMD simulations, the same box dimensions were applied in the x and y directions. Along the pulling (z) direction, the box length was extended to accommodate the theoretical stretching length, defined as the initial N–C terminal distance plus 0.36 nm per stretched residue, while maintaining a 1.2 nm buffer at both ends (2.4 nm total). These details have now been clarified in the revised Supporting Information.

      From this last point, a possible criticism arises: Do the unfolded proteins really still stay far enough away from themselves to not influence the result?

      We analyzed the minimum atomic distance between each protein and its periodic images to assess potential artifacts from periodic boundary conditions. For all simulation stages used in screening and statistical analysis, the minimum protein–image separation remained above 1.0 nm for the majority of the simulation time, exceeding the nonbonded interaction cutoff and minimizing cross-boundary interactions. As shown in the Author response image 1for SpecAI89 (left), this separation during SMD simulations is consistently well above the threshold, indicating that the chosen box dimensions are appropriate. In the very late stages of annealing MD, highly unstable proteins may exhibit large conformational fluctuations and transient boundary proximity (right); however, these regimes are associated with large RMSD deviations and are excluded from analysis. Notably, the mechanically relevant unfolding events occur near the center of the simulation box and proceed along the pulling axis in SMD simulations, making boundary effects unlikely to influence the unfolding process or the relative mechanostability ranking.

      Author response image 1.

      Analysis of the minimum atomic distance between the protein and its periodic images under periodic boundary conditions. Left: SpecAI89 during SMD simulations, showing that the minimum protein–image distance remains above 1.0 nm for the majority of the simulation time. Right: WT during AMD simulations, where transient proximity to the periodic boundary is observed at very late stages due to large conformational fluctuations.

      Additionally, no time series are shown for the equilibration phases (e.g., RMSD evolution over time), which would empower the reader to judge the equilibration of the system before either steered MD or annealing MD is performed.

      We thank the reviewer for this suggestion. To assess equilibration, we analyzed the backbone RMSD evolution during the equilibration phase. Using SpecAI89 as a representative example (Author response image 2), the protein backbone RMSD converges rapidly and reaches a stable plateau within approximately 5 ps. The subsequent 125 ps equilibration period therefore sufficiently demonstrates that the system is well equilibrated prior to both steered MD and annealing MD simulations.

      Author response image 2.

      The backbone RMSD of SpecAI89 over time during simulation

      Reviewer #1 (Recommendations for the authors):

      (1) In Figure S2, only one copy (or the average of the three copies; it is not clear from the caption) is shown, would be better to show the individual traces for each repeat. Additionally, only the plot for the forces is shown, and not, similarly to the AMD, the RMSD plot. This could be a stylistic choice, but it just reports on how much force was applied and not on how the protein responded to the force. Moreover, horizontal lines at the maximum value reached by the force could be added in order to directly see the difference in force applied, since it is then remarked on.

      Figure S2 originally shows a representative single SMD trajectory, as the force–extension peak positions vary between independent simulations and averaging the force traces would obscure the characteristic force peaks. In the revised Supplementary Information, we have now added the force–extension traces from the other two independent SMD repeats for each construct (New Figure S2). In addition, horizontal lines indicating the maximum force reached in each trajectory have been included to facilitate direct comparison of force differences between designs.

      (2) In Figure S3 the plots have different y-axis. Maybe it could be valuable to modify it so that in figures b, c, and d the spectrum result is in the background (perhaps in gray) so that the y-axis is not changed to retain the information included in this plot, but one could still compare directly to the spectrum result. With a 0 to 1 nm y-axis part of the spectrin run will be hidden, but in any case, plot a can be used to see the full behavior. Similarly to S2, the repeats (if any) could be shown.

      We have revised Figure S3 as suggested. The y-axis is now unified to 0–1.2 nm across all panels. For panels b–d, the natural spectrin trajectory is displayed in light gray in the background for direct comparison. Additionally, three independent MD replicates are now presented for each construct to demonstrate reproducibility.

      Finally, minor remarks that could nevertheless improve the paper:

      (3) In Figure S7, a bimodal distribution model for the number of events could be used to fit the data better.

      We thank the reviewer for the detailed suggestion. Following this advice, we explored the bimodal Gaussian distribution model for fitting the force-event data in Figure S7. Indeed, our analysis showed that a bimodal fit could fit Figures S7 panel f better (as shown in Author response image 3). The two peaks were centered at F<sub>1</sub> = 190 ± 4 pN and F<sub>2</sub> = 380 ± 6 pN. Interestingly, the force of the first major peak obtained is the same as the previously fitted value. The second one is double force value which we guess maybe is a bi-molecule stretched for unknown reason. Considering the very few numbers of the second peak and the same force value (190 pN), we decide not to change the unfolding force value in the manuscript. But we thank this reviewer’s insightful comment.

      Author response image 3.

      The bimodal fit for unfolding force of SpecAI88-49E102K-6H149H show the same 190 pN unfolding for the first peak as previous fit.

      (4) The colors in the video are not very intuitive, as the spectrin is shown initially in light blue, but becomes grey in the variants, where light blue is reserved for the additional helix. A counter of elapsed time and/or force/temperature applied could help the readers orient. Maybe it could be useful to produce a video with spectrin and the three variants all shown together?

      We thank this comment. The videos have been revised to improve clarity and consistency accordingly. In all cases, the original protein scaffold is now shown in gray, while the additional helix in the designed variants is highlighted in blue. Real-time annotations have been added to aid interpretation: the instantaneous temperature is displayed during AMD simulations, and time is shown during SMD simulations. In addition, for ease of comparison, the AMD and SMD results of all four proteins are each compiled into a single combined video, allowing their behaviors to be viewed side by side.

      Reviewer #2 (Public review):

      Qiu, Jun et. al., developed and validated a computational pipeline aimed at stabilizing α-helical bundles into very stable folds. The computational pipeline is a hierarchical computational methodology tasked to generate and filter a pool of candidates, ultimately producing a manageable number of high-confidence candidates for experimental evaluation. The pipeline is split into two stages. In stage I, a large pool of candidate designs is generated by RFdiffusion and ProteinMPNN, filtered down by a series of filters (hydropathy score, foldability assessed by ESMFold and AlphaFold). The final set is chosen by running a series of steered MD simulations. This stage reached unfolding forces above 100pN. In stage II, targeted tweaks are introduced - such as salt bridges and metal ion coordination - to further enhance the stability of the α-helical bundle. The constructs undergo validation through a series of biophysical experiments. Thermal stability is assessed by CD, chemical stability by chemical denaturation, and mechanical stability by AFM.

      Strengths:

      A hierarchical computational approach that begins with high-throughput generation of candidates, followed by a series of filters based on specific goal-oriented constraints, is a powerful approach for a rapid exploration of the sequence space. This type of approach breaks down the multi-objective optimization into manageable chunks and has been successfully applied for protein design purposes (e.g., the design of protein binders). Here, the authors nicely demonstrate how this design strategy can be applied to successfully redesign a moderately stable α-helical bundle into an ultrastable fold. This approach is highly modular, allowing the filtering methods to be easily swapped based on the specific optimization goals or the desired level of filtering.

      We are thankful for the reviewer’s diligent evaluation and positive remarks. His/her concluding remarks, which encourage our future work at the intersection of AI-protein design and AFM-SMSF, are especially appreciated. All comments have been incorporated into our revisions.

      Weaknesses:

      Assessing the change in stability relative to the WT α-helical bundle is challenging because an additional helix has been introduced, resulting in a comparison between a three-helix bundle and a four-helix bundle. Consequently, the appropriate reference point for comparison is unclear. A more direct and informative approach would have been to redesign the original α-helical bundle of the human spectrin repeat R15, allowing for a more straightforward stability comparison.

      This is an insightful comment. Indeed, a direct comparison between the same structure of the three-helix bundle will be most straightforward with a clear reference point. I will take this advice and try it in our future endeavor.

      In our case, a substantial fraction of the hydrophobic region is relatively shallow and partially solvent-exposed in the wild-type R15 α-helical bundle. So, the added fourth helix provides a new hydrophobic packing interface, increasing core burial, packing density, and strengthening the internal load-bearing network. Consistent with this design rationale, rSASA analysis shows that the designed proteins exhibit a higher degree of hydrophobic core burial compared to the wild-type R15. Specifically, the fraction of residues with rSASA < 0.2 exceeds 30% in the designs, compared to 23% in the natural spectrin repeat.

      While the authors have shown experimentally that stage II constructs have increased the mechanical stability by AFM, they did not show that these same constructs have increased the thermal and chemical stabilities. Since the effects of salt bridges on stability are highly context dependent (orientation, local environment, exposed vs buried, etc.), it is difficult to assess the magnitude of the effect that this change had on other types of stabilities.

      We agree that the effects of salt bridges are highly context-dependent and that different dimensions of stability do not always correlate. Following your suggestion, we evaluated the thermal and chemical stabilities of the Stage II constructs. The experimental results (now added as Figure S9) show that Stage II designs successfully maintain the high thermal stability and resistance to chemical denaturation to different extend. The thermal stability is still as high as the Stage I but the resistance to chemical denaturation is slightly reduced. We have added this result in the manuscript accordingly.

      The three constructs chosen are 60-70% identical to each other, either suggesting overconstrained optimization of the sequence or a physical constraint inherent to designing ultrastable α-helical bundles. It would be interesting to explore these possible design principles further.

      Yes, the observed sequence convergence likely arises from a combination of intrinsic physical constraints of the protein architecture and the applied design and screening criteria. In particular, the tightly packed hydrophobic core imposes strong constraints on side-chain size, packing complementarity, and the alignment of heptad-like motifs reminiscent of coiled-coil organization, which collectively reduce the accessible sequence space. In addition, the strong selection pressure imposed by foldability and stability filters further promotes convergence toward similar solutions. And we agree with the reviewer that this represents an important direction for future work.

      While the use of steered MD is an elegant approach to picking the top N most stable designs, its computational cost may become prohibitive as the number of designs increases or as the protein size grows, especially since it requires simulating a water box that can accommodate a fully denatured protein

      Yes, steered MD can become computationally expensive, particularly as the number of designs increases or as protein size grows. Considering the vast pool created by AI, SMD in this work was applied to a relatively small, high-confidence subset of candidates after multiple rounds of rapid prescreening, keeping the overall computational cost manageable. In future applications, this step could be further accelerated by integrating machine-learning–based predictors to improve scalability.

      Reviewer #2 (Recommendations for the authors):

      I am not convinced that the difference in rSASA between the designs and the natural spectrin repeat is meaningful. It would be helpful to report confidence intervals for the rSASA values of the designs to clarify whether any differences are statistically robust. Even if such differences prove statistically significant, it is not clear that they are large enough to be practically meaningful.

      In our analysis, rSASA values were calculated from equilibrated MD conformations and were consistently higher for all designed proteins that passed the simulation-based screening compared to the wild-type spectrin repeat. However, we believe that rSASA was used only as a supportive structural descriptor to indicate a trend toward a more compact and better-buried hydrophobic core, rather than as a standalone or decisive metric of stability.

      Protein stability is indeed influenced by multiple factors, including hydrogen bonding, salt bridges, metal coordination, and topology-dependent load-bearing interactions, none of which are captured by rSASA alone. Therefore, we agree with the reviewer that differences in rSASA alone should not be overinterpreted as a quantitative measure of protein stability. For this reason, rSASA was not used as a ranking criterion or a predictor of stability, but only as complementary evidence consistent with the overall design rationale and with the experimentally observed stability enhancements.

      The claim "The strong agreement between computational rankings and experimental measurements validates this approach for prioritizing designs based on relative mechanostability, offering a practical pipeline to bridge the gap between in silico design and experimental validation." should be substantiated by a citation or a figure. Since the authors have the experimental AFM data and steered MD data, I suggest adding a Spearman correlation plot of the two.

      Following this comment, we examined the Spearman rank correlation between SMD-derived unfolding forces and experimentally measured AFM forces (Author response image 4). The resulting correlation was modest (ρ = 0.4, p = 0.6), which is not unexpected given (i) the large difference in force and timescales between high-speed SMD simulations and single-molecule AFM experiments, and (ii) the limited number of designs and simulation repeats available.

      Nevertheless, qualitatively, the difference between the first point from wt-spectrin and the other three specAI is clear. Considering the large computational cost, we only performed three times simulation one each design to balance the accuracy and the cost/time. To avoid overinterpretation, we therefore did not include the correlation analysis in the main text and revised the manuscript to soften claims of strong agreement, emphasizing instead the qualitative and comparative role of SMD in the design pipeline.

      Author response image 4.

      Spearman correlation between SMD and AFM unfolding forces for natural spectrin and SpecAI designs. SMD force (x-axis) versus experimental AFM force (y-axis); each point represents one protein.

      Reviewer #3 (Public review):

      Summary:

      Qiu et al. present a hierarchical framework that combines AI and molecular dynamics simulation to design an α-helical protein with enhanced thermal, chemical, and mechanical stability. Strategically, chemical modification by incorporating additional α-helix, site-specific salt bridges, and metal coordination further enhanced the stability. The experimental validation using single-molecule force spectroscopy and CD melting measurements provides fundamental physical chemical insights into the stabilization of α-helices. Together with the group's prior work on super-stable β strands (https://www.nature.com/articles/s41557-025-01998-3), this research provides a comprehensive toolkit for protein stabilization. This framework has broad implications for designing stable proteins capable of functioning under extreme conditions.

      Strengths:

      The study represents a complete framework for stabilizing the fundamental protein elements, α-helices. A key strength of this work is the integration of AI tools with chemical knowledge of protein stability.

      The experimental validation in this study is exceptional. The single-molecule AFM analysis provided a high-resolution look at the energy landscape of these designed scaffolds. This approach allows for the direct observation of mechanical unfolding forces (exceeding 200 pN) and the precise contribution of individual chemical modifications to global stability. These measurements offer new, fundamental insights into the physicochemical principles that govern α-helix stabilization.

      We appreciate the positive assessment of our manuscript from this reviewer and his/her support. We have answered all the comments as follows and modified the manuscript accordingly.

      Weaknesses:

      (1) The authors report that appending an additional helix increases the overcall stability of the α-helical protein. Could the author provide a more detailed structural explanation for this? Why does the mechanical stability increase as the number of helixes increase? Is there a reported correlation between the number of helices (or the extent of the hydrophobic core) and the stability?

      In multi-helix bundle proteins, tight interhelical packing leads to the formation of a dense hydrophobic core, which substantially enhances overall structural stability. The introduction of an additional helix does not merely increase helix count, but expands the buried hydrophobic interface, improving packing density and cooperative side-chain interactions in the core. This, in turn, strengthens the internal load-bearing network that resists force-induced unfolding.

      From a mechanical perspective, adding a helix also increases topological interlocking among secondary-structure elements, which raises the energetic barrier for unfolding and shifts the unfolding pathway toward more cooperative rupture events, thereby increasing the unfolding force threshold. Consistent with this design principle, pioneering studies have reported a positive correlation between the number of helices (or the extent of the hydrophobic core) in helix bundles and their stability (Lim et al., Structure, 2008, 16:449; Minin et al., J. Am. Chem. Soc., 2017, 139, 16168; Bergues-Pupo et al., Phys. Chem. Chem. Phys., 2018, 20, 29105). Inspired by these works, our AI-protein design study uses the appended helix to reinforce the hydrophobic core rather than simply increasing secondary-structure content.

      (2) The author analyzed both thermal stability and mechanical stability. It would be helpful for the author to discuss the relationship between these two parameters in the context of their design. Since thermal melting probes equilibrium stability (ΔG), while mechanical stability probes the unfolding energy barriers along the pulling coordinate.

      We agree this is a crucial distinction. Thermal and chemical stabilities report on the equilibrium free energy (ΔG), while mechanical stability probes the kinetic unfolding barrier (ΔG‡) along a force-dependent pathway. Their inherent difference makes concurrent improvement in all parameters a non-trivial task, which highlights the importance and success of our integrative design approach.

      (3) While the current study demonstrates a dramatic increase in global stability, the analysis focuses almost exclusively on the unfolding (melting) process. However, thermodynamic stability is a function of both folding (k<sub>f</sub>) and unfolding (k<sub>u</sub>) rates. It remains unclear whether the observed ultrastability is primarily driven by a drastic decrease in the unfolding rate (k<sub>u</sub>) or if the design also maintains or improves the folding rate (k<sub>f</sub>)?

      We agree with the reviewer that thermodynamic stability is determined by both the folding rate (k<sub>f</sub>) and the unfolding rate (k<sub>u</sub>). In the present study, we did not directly measure folding kinetics, and therefore cannot quantitatively deconvolute the respective contributions of k<sub>f</sub> and k<sub>u</sub> to the observed ultrastability. Based on the design strategy and the experimental observations, we propose that the enhanced stability primarily originates from a substantial reduction in the unfolding rate (k<sub>u</sub>), corresponding to an increased unfolding energy barrier. The reinforcement of the hydrophobic core, the introduction of stabilizing interactions such as salt bridges and metal coordination, and the additional helix that increases topological and packing constraints all raise the energetic cost of disrupting key interactions in the folded state.

      This interpretation is consistent with the high mechanical unfolding forces observed in both AFM experiments and SMD simulations. In contrast, these stabilizing features are not necessarily expected to accelerate folding and may even modestly increase folding complexity. Addressing folding kinetics explicitly would require dedicated kinetic experiments or simulations, which are beyond the scope of the present work but represent an interesting direction for future studies.

      (4) The authors chose the spectrin repeat R15 as the starting scaffold for their design. R15 is a well-established model known for its "ultra-fast" folding kinetics, with folding rates (k<sub>f</sub> ~105s), near three orders of magnitude faster than its homologues like R17 (Scott et.al., Journal of molecular biology 344.1 (2004): 195-205). Does the newly designed protein, with its additional fourth helix and site-specific chemical modifications, retain the exceptionally high folding rate of the parent R15?

      We did not directly measure the folding kinetics of the newly designed proteins, and therefore cannot determine whether they retain the exceptionally fast folding rate reported for the parent spectrin repeat R15. While R15 is known for its ultrafast folding behavior, the introduction of an additional fourth helix and site-specific chemical modifications, although beneficial for enhancing stability, may increase the complexity of the folding landscape and do not necessarily guarantee that the folding rate (k<sub>f</sub>) remains comparable to that of R15.

      Reviewer #3 (Recommendations for the authors):

      (1) Please clarify the used Gaussian function to fit the unfolding force distribution (Figure 3-4). In Figure S8, the Bell-Evans model is used to analyze unfolding force. The authors should explain the choice of fitting methods and ensure consistency.

      The Gaussian fitting used in Figures 3–4 is intended as a descriptive statistical analysis to summarize the unfolding force distributions and to facilitate direct comparison between different designs. This approach provides a robust estimate of the most probable unfolding force and the distribution width, without invoking a specific physical unfolding model, and is commonly used in single-molecule force spectroscopy for comparative purposes.

      In contrast, the Bell-Evans model applied in Figure S8 is a kinetic framework that explicitly accounts for force-loading-rate dependence and is used to extract mechanistic insights into the unfolding process. Therefore, the two fitting approaches serve complementary roles: Gaussian fitting for quantitative comparison and ranking of mechanostability, and Bell-Evans analysis for mechanistic interpretation. We have clarified this distinction and the rationale for using both methods in the revised Supplementary Information to ensure consistency and transparency.

      (2) The authors utilized steered MD simulation to analyze the mechanical properties via ForceGen (Ni et al., 2024, Sci. Adv. 10, eadl4000). However, the significant discrepancy between the predicted unfolding force (~600 pN) and the experimental value (~50 pN for spectrin, line 376) requires further justification (line 376). Please clarify how the accuracy of these predictions can be established. Specifically, do the MD simulations successfully capture the relative ranking or trends in stability across the different designed variants?

      We agree with the reviewer that there is a substantial discrepancy between the absolute unfolding forces predicted by SMD simulations (~ 600 pN) and those measured experimentally by AFM (~ 50 pN for spectrin). This difference primarily arises from the orders-of-magnitude mismatch in loading rates between simulations and experiments. In our SMD simulations, the pulling velocity (~10<sup>9</sup> nm/s) is several orders of magnitude higher than that used in AFM experiments (~10<sup>3</sup> nm/s), which is to systematically elevate the apparent unfolding force. In addition to loading-rate effects, limitations in force-field accuracy, finite system size, and restricted conformational sampling further contribute to deviations in absolute force values. As a result, the unfolding forces obtained from SMD are not intended to provide quantitative agreement with experimental measurements or absolute mechanical stability.

      Instead, SMD is employed here as a comparative screening tool to assess relative mechanostability across different designed variants under identical simulation conditions. Despite the limited number of repeats imposed by computational cost, the simulations consistently distinguish candidates with markedly different mechanical responses. Importantly, the variants identified by SMD as more mechanically stable were subsequently confirmed experimentally to exhibit enhanced mechanostability relative to the wild-type spectrin repeat. Therefore, while SMD does not yield quantitatively accurate unfolding forces, it successfully captures relative stability trends and provides a practical and effective means for prioritizing designs prior to experimental validation.

    1. eLife Assessment

      This is an important study showing that movement vigor is not solely an individual property but emerges through interaction when two people are physically linked. The evidence is convincing, supported by a well-controlled experimental design and modeling that closely match the observed behavior. While the authors provided a helpful comparison of several candidate models of human-human interaction dynamics, the statistical power remains limited.

    2. Reviewer #1 (Public review):

      Summary:

      The authors present a novel investigation of movement vigor of individuals completing a synchronous extension-flexion task. Participants were placed into groups of two (so-called "dyads") and asked to complete shared movements (connected via a virtual loaded spring) to targets placed at varying amplitudes. The authors attempted to quantify what, if any, adjustments in movement vigor individual participants made during the dyadic movements, given the combined or co-dependent nature of the task. This is a novel, timely question of interest within the broader field of human sensorimotor control.

      Participants from each dyad were labeled as "slow" (low vigor) or "fast" (high vigor), and their respective contributions to the combined movement metrics assessed. The authors presented four candidate models for dyad interactions: (a) independent motor plans (i.e., co-activity hypothesis), (b) individual-led motor plans (i.e., leader-follower hypothesis), (c) generalization to a weighted average motor plan (i.e., weighted adaptation hypothesis), and (d) an uncertainty-based model of dynamic partner-partner interaction (i.e., interactive adaptation hypothesis). The final model allowed for dynamic changes in individual motor plans (and therefore, movement vigor) based on partner-partner interactions and observations. After detailed observations of interaction torque and movement duration (or, vigor), the authors concluded that the interactive adaptation model provided the best explanation of human-human interaction during self-paced dyadic movements.

      Strengths:

      The experimental setup (simultaneous wrist extension-flexion movements) has been thoroughly vetted. The task was designed particularly well, with adequate block pseudo-randomization to ensure general validity of the results. The analyses of torque interaction, movement kinematics, and vigor are sound, as are the statistical measures used to assess significance. The authors structured the work via a helpful comparison of several candidate models of human-human interaction dynamics, and how well said models explained variance in the vigor of solo and combined movements.

      The authors adequately addressed several concerns that I raised in my initial review of the work, including clarity regarding analyses of movement vigor and inclusion of additional analyses of reaction time. The results are supported by both parametric and non-parametric statistical methods.

      The research question is timely and extends current neuroscientific understanding of sensorimotor control, particularly in social contexts. This work answers several new, important questions about control of vigor during volitional movements, and in doing so it motivates future research into the topic.

      Weaknesses:

      My chief concern about the study is the relatively low number of dyad data points (n=10). The authors recruited 20 participants, but the primary conclusions are based on dyad-specific interactions (i.e., analyses of "fast" vs "slow" participants in each pair). However, it is important to note that most of the effects upon which the conclusions rest are associated with relatively large effect sizes.

    3. Reviewer #2 (Public review):

      Summary:

      This study examines how individual movement vigor is integrated into a shared, dyadic vigor when two individuals are physically coupled. Participants performed wrist-reaching movements toward targets at different distances while mechanically linked via a virtual elastic band, and dyads were formed by pairing participants with different baseline vigor profiles. Under interaction conditions, movements converged to coordinated patterns that could not be explained by simple averaging, indicating that each dyad behaved as a single functional unit. Notably, under coupling, movement durations for both partners were shorter than in the solo condition, arguing against the view that each individual simply executed an independent movement plan. Furthermore, dyadic vigor was primarily predicted by the slower partner's vigor rather than by the faster partner's, suggesting that neither a leader-follower strategy nor a weighted averaging account fully explains the observed behavior. The authors propose a computational model in which both partners adapt to the emerging interaction dynamics ("interactive adaptation strategy"), providing a coherent explanation of the behavioral observations.

      Strengths:

      The study is carefully designed and addresses an important question about how individual movement vigor is integrated during joint action. The experimental paradigm allows systematic manipulation of interaction strength and partner asymmetry. The behavioral results show clear and robust patterns, particularly the shortening of movement durations under elastic coupling (KL and KH condition) and the asymmetrical contribution of the slower partner's vigor to dyadic vigor. The computational model captures the main behavioral patterns well and provides a principled framework for interpreting dyadic vigor not as a simple combination of two independent motor plans, but as an emergent property arising from mutual adaptation. Conceptually, the study is notable in extending the notion of vigor from an individual attribute to a dyad-level construct, opening a new perspective on coordinated movement and motor decision-making.

      Weaknesses:

      The revised manuscript now clearly explains why the proposed computational model successfully accounts for the observed dyadic behavior. In particular, the mechanisms by which uncertainty associated with the slower partner and time-related costs of the faster partner jointly shape dyadic vigor are now clear. I have no further comments to add.

    4. Reviewer #3 (Public review):

      Summary:

      This study provides novel insights into how individuals regulate the speed of their movements both alone and in pairs, highlighting consistent differences in movement vigor across people and showing that these differences can adapt in dyadic contexts. The findings are significant because they reveal stable individual patterns of action that are flexible when interacting with others, and they suggest that multiple factors, beyond reward sensitivity, may contribute to these idiosyncrasies. The evidence is generally strong, supported by careful behavioral measurements and appropriate modeling.

      The authors have addressed all of my previous comments. I appreciate the clarification of abbreviations, terminology, and key concepts, the expansion of the discussion, and the adjustments to some of the statistical analyses in response to both my earlier comments and those of Reviewer 1.

    5. Author response:

      The following is the authors’ response to the original reviews.

      We thank the reviewers for their constructive and precise comments, which have helped us improve the consistency and clarity of our manuscript. Below, we provide a point-by-point response to each comment. In summary, the main changes introduced in the revised version are as follows:

      (1) We replaced all the statistical analyses to their non-parametric equivalents to ensure compliance with test assumptions and consistency of the results;

      (2) We compare the participants’ reaction times before and during connected practice, revealing a significant reduction in reaction times of both partners when connected;

      (3) We added, in the supplementary materials, a table reporting the vigor scores of each participant in each experimental condition, facilitating the assessment of individual and dyadic behaviors;

      (4) We have reviewed and refined the terminology throughout the manuscript and reduced the number of abbreviations to improve clarity.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      The authors present a novel investigation of the movement vigor of individuals completing a synchronous extension-flexion task. Participants were placed into groups of two (so-called "dyads") and asked to complete shared movements (connected via a virtual loaded spring) to targets placed at varying amplitudes. The authors attempted to quantify what, if any, adjustments in movement vigor individual participants made during the dyadic movements, given the combined or co-dependent nature of the task. This is a novel, timely question of interest within the broader field of human sensorimotor control.

      Participants from each dyad were labeled as "slow" (low vigor) or "fast" (high vigor), and their respective contributions to the combined movement metrics were assessed. The authors presented four candidate models for dyad interactions: (a) independent motor plans (i.e., co-activity hypothesis), (b) individual-led motor plans (i.e., leader-follower hypothesis), (c) generalization to a weighted average motor plan (i.e., weighted adaptation hypothesis), and (d) an uncertainty-based model of dynamic partner-partner interaction (i.e., interactive adaptation hypothesis). The final model allowed for dynamic changes in individual motor plans (and therefore, movement vigor) based on partner-partner interactions and observations. After detailed observations of interaction torque and movement duration (or vigor), the authors concluded that the interactive adaptation model provided the best explanation of human-human interaction during self-paced dyadic movements.

      Strengths:

      The experimental setup (simultaneous wrist extension-flexion movements) has been thoroughly vetted. The task was designed particularly well, with adequate block pseudo-randomization to ensure general validity of the results. The analyses of torque interaction, movement kinematics, and vigor are sound, as are the statistical measures used to assess significance. The authors structured the work via a helpful comparison of several candidate models of human-human interaction dynamics, and how well said models explained variance in the vigor of solo and combined movements. The research question is timely and extends current neuroscientific understanding of sensorimotor control, particularly in social contexts.

      We thank the reviewer for their in-depth analysis and constructive assessment of our manuscript.

      Weaknesses:

      (1) My chief concern about the study as it currently stands is the relatively low number of data points (n=10). The authors recruited 20 participants, but the primary conclusions are based on dyad-specific interactions (i.e., analyses of "fast" vs "slow" participants in each pair). Some of these analyses would benefit greatly, in terms of power, from the addition of more data points.

      We understand and appreciate the reviewer’s concern regarding the effective sample size at the dyad level (n=10). While our primary analyses focus on dyad-specific interactions, we note that the reported effects are consistent across multiple dynamic conditions and are associated with large effect sizes. To provide a conservative assessment the Cohen’s D values reported correspond to the smallest effect size observed across the relevant statistical tests, thereby limiting the risk of false positives or overinterpretation. In addition, to ensure robustness given the sample size and distribution properties of the data, we have replaced all parametric tests with their non-parametric counterparts, as some analyses violated ANOVA assumptions. Friedman and Kruskal-Wallis tests are now used for paired and unpaired main effects respectively, and Wilcoxon and Mann-Whitney tests for paired and unpaired post-hoc comparisons respectively. Note that these changes did not alter the conclusions of the study.

      (a) The distribution of delta-vigor (Fast group vs Slow group) is highly skewed (see Figures 3D, S6D), with over half of the dyads exhibiting delta-vigor less than 0.2 (i.e., less than 20% of unit vigor). Given the relatively low number of dyads, it would be helpful for the authors to provide explicit listings of VigorFast, VigorSlow, and VigorCombined for each of the 10 separate dyads or pairings.

      We agree with this comment. However, we note that the distribution of vigor scores within a population is typically centered around 1, with large deviations observed only for the fastest and slowest participants [1]. As a result, the distri bution of ∆-vigor is inherently skewed. Correcting for this skewness would (i) require pairing participants based on their vigor, which is logistically difficult, and (ii) lead to an atypical sampling of dyads, with an over representation of pairs exhibiting very large vigor differences. The distributions of vigor scores for the fast and slow groups before and after the interaction are reported in Supplementary Fig. S21. In addition, as suggested by the reviewer, we have now included Table S.1 in the supplementary materials, listing the values VigorFast, VigorSlow, and VigorCombined for each of the 10 dyads. This table provides a complete view of the evolution of participant’s vigor throughout the experiment.

      (b) The authors concluded that the interactive adaptation hypothesis provided the best summary of the combined movement dynamics in the study. If this is indeed the case, then the relative degree of difference in vigor between the fast and slow participants in a dyad should matter. How well did the interactive adaptation model explain variance in the dyads with relatively low delta-vigor (e.g., less than 0.2) vs relatively high delta-vigor?

      We initially expected the magnitude of difference in individual vigor within a dyad to play a significant role. However, our analysis did not reveal any systematic effect of ∆-vigor on either the interaction force or the resulting dyadic vigor, as shown by the LMM analysis. Importantly, the interactive adaptation hypothesis does per se imply that the magnitude of vigor differences between the two partners should matter, only that their respective roles in selecting the adapted behavior is different. Although the model includes several free parameters, we did not attempt to fit it to individual dyads as would in principle be possible. Instead, we performed a sensitivity analysis to assess how variations in the difference in vigor between the partners influence model predictions. For this purpose, we simulated increasing values of µ and variations in the fast partner’s cost of time. In addition, we demonstrated that uncertainty in the estimated behavior of the slow partner, which is a priori specific to each individual, has a substantial impact on the optimal movement duration of the dyad. Overall, this analysis shows that the model captures the full range of qualitative trends observed in the experimental data. When applied to predict the behavior of the average dyad, the resulting movement time prediction error remain small, as detailed in the Results section.

      (2) The authors shared the results of one analysis of reaction time, showing that the reaction times of the slow partners and the fast partners did not differ during the initial passive block. Did the authors observe any changes in RT of either the slow or fast partner during the combined (primary task) blocks (KL, KH, etc.)? If the pairs of participants did indeed employ a form of interactive adaptation, then it is certainly plausible that this interaction would manifest in the initial movement planning phase (i.e., RT) in addition to the vigor and smoothness of the movements themselves.

      We thank the reviewer for this interesting question, that prompted us to extend our analysis of reaction times to the connected conditions. This additional analysis revealed a significant main effect of the condition on the reaction time for both the fast and slow groups (in both cases: W<sub>2</sub> > 0.39, p < 0.02). Post-hoc comparisons showed a significant reduction in reaction time between the initial null-field block (NF1) and the KH condition for the slow group (p = 0.03, D = 1.46), and a similar trend for the fast group (p = 0.06, D = 1.03). However, the reaction times remained comparable between the two groups, with no significant difference between them. We have incorporated these observations in the Results section (p.4, l.100–109) and expanded the Discussion (p.11, l.341–348) to address their implications for interactive adaptation in human-human and human-robot physical interactions.

      Reviewer #2 (Public review):

      Summary:

      This study examines how individual movement vigor is integrated into a shared, dyadic vigor when two individuals are physically coupled. Participants performed wrist-reaching movements toward targets at different distances while mechanically linked via a virtual elastic band, and dyads were formed by pairing participants with different baseline vigor profiles. Under interaction conditions, movements converged to coordinated patterns that could not be explained by simple averaging, indicating that each dyad behaved as a single functional unit. Notably, under coupling, movement durations for both partners were shorter than in the solo condition, arguing against the view that each individual simply executed an independent movement plan. Furthermore, dyadic vigor was primarily predicted by the slower partner’s vigor rather than by the faster partner’s, suggesting that neither a leader-follower strategy nor a weighted averaging account fully explains the observed behavior. The authors propose a computational model in which both partners adapt to the emerging interaction dynamics ("interactive adaptation strategy"), providing a coherent explanation of the behavioral observations.

      Strengths:

      The study is carefully designed and addresses an important question about how individual movement vigor is integrated during joint action. The experimental paradigm allows systematic manipulation of interaction strength and partner asymmetry. The behavioral results show clear and robust patterns, particularly the shortening of movement durations under elastic coupling (KL and KH conditions) and the asymmetrical contribution of the slower partner’s vigor to dyadic vigor. The computational model captures the main behavioral patterns well and provides a principled framework for interpreting dyadic vigor not as a simple combination of two independent motor plans, but as an emergent property arising from mutual adaptation. Conceptually, the study is notable in extending the notion of vigor from an individual attribute to a dyad-level construct, opening a new perspective on coordinated movement and motor decision-making.

      We thank the reviewer for their thorough analysis of our manuscript and their constructive feedback.

      Weaknesses:

      (1) A key conceptual issue concerns the apparent asymmetry between partners in the computational framework. While dyadic vigor is empirically better predicted by the slower partner’s vigor, the model formulation appears to emphasize the faster partner’s time-related cost and interaction forces. Although the cost function includes an uncertaintyrelated component associated with the slower partner, it remains unclear from the current formulation and description how dyadic vigor is formally derived from the slower partner’s control policy within the same modeling framework. This raises an important question regarding whether the model offers a symmetric account of dyadic vigor formation for both partners or whether it is effectively anchored to the faster partner’s control architecture.

      We have modified our phrasing to clarify the principles according to which the computational framework was designed (p.7, l.226–231 and p.9, l.260–264). As stated in the Results section, the model is indeed asymmetric by design, which corresponds to the different roles of the fast and slow partner exhibited in the data. In that context, the uncertain term associated with the slow partners should be understood as an overarching constraint that conditions the strategy of the dyad, while the fast partner cost of time acts as a contributor to the expected dyad strategy. Conceptually and numerically as reported in the sensitivity analysis, this asymmetry corresponds to the role of the slow partners in setting the vigor ranking among the dyads and the role of the fast partner in setting the average dyadic behavior.

      (2) A second conceptual issue concerns the interpretation of the term "motor plan." It remains unclear whether this term refers primarily to movement-related characteristics such as speed or duration, or more broadly to the underlying optimization structure that governs these variables. This distinction is theoretically important, as it determines whether the reported interaction effects should be understood as adjustments in movement characteristics or as changes in the structure of the control policy itself.

      We agree with the reviewer that this terminology required clarification. In this paper, the term “motor plan” refers to the time series of control inputs planned by the CNS, rather than solely to kinematic descriptors such as speed or duration. These planned control signals are a direct consequence of the underlying optimization structure and cost functions that govern trajectory generation. We have clarified this definition in the Introduction (p.1, l.23–24).

      Reviewer #3 (Public review):

      Strengths:

      This study provides novel insights into how individuals regulate the speed of their movements both alone and in pairs, highlighting consistent differences in movement vigor across people and showing that these differences can adapt in dyadic contexts. The findings are significant because they reveal stable individual patterns of action that are flexible when interacting with others, and they suggest that multiple factors, beyond reward sensitivity, may contribute to these idiosyncrasies. The evidence is generally strong, supported by careful behavioral measurements and appropriate modeling, though clarifying some statistical choices and including additional measures of accuracy and smoothness would further strengthen the support for the conclusions.

      Thank you for this analysis and the insightful feedback.

      Major Comments:

      (1) Given the idiosyncrasies in individual vigor, would linear mixed models (LMMs) be more appropriate than ANOVAs in some analyses (e.g., in the section "Solo session"), as they can account for random intercepts and slopes on vigor measures? Some figures (e.g., Figure 2.B and 3.E) indeed seem to show that some aspects of behaviour may present variability in slopes and intercepts across participants. In fact, I now realize that LMMs are used in the "Emergence of dyadic vigor from the partners’ individual vigor" section, so could the authors clarify why different statistical approaches were applied depending on the sections?

      We thank the reviewer for this thoughtful comment. We deliberately used different statistical approaches throughout the paper in order to address different types of questions. Note that the statistical tests were converted to their nonparametric equivalent for consistency (see answer to Reviewer 1).

      - Friedman tests were used in a limited number of cases to assess population- or group-level effects, such as differences in movement time, smoothness, or accuracy across the solo, connected, and after-effects conditions. Such tests provide a straightforward framework for these descriptive, condition-level comparisons.

      - The stability of individual and dyadic vigor scores across conditions was assessed using Pearson correlations across all condition pairs, which we consider the most direct and interpretable approach for evaluating consistency across sessions.

      - LMMs were employed to examine how dyadic vigor relates to the partners’ individual vigor measured in the solo conditions, which revealed the critical contribution of the slow partner.

      Rather than applying a single statistical framework throughout, we selected the method best suited to each question. While LMMs are well suited for modeling participant-specific variability when linking individual and dyadic measures, their systematic use in all analyses would be less intuitive and would not directly address several of the population-level comparisons central to this study.

      (2) If I understand correctly, the introduction suggests that idiosyncrasies in movement vigor may be driven by interindividual differences in reward sensitivity. However, the current task does not involve any explicit rewards, yet the authors still observe idiosyncrasies in vigor, which is interesting. Could this indicate that other factors contribute to these consistent individual differences? For example, could sensitivity to temporal costs or physical effort explain the slow versus fast subgrouping? Specifically, might individuals more sensitive to temporal costs move faster to minimize opportunity costs, and might those less sensitive to effort costs also move faster? Along the same lines, could the two subgroups (slow vs. fast) be characterized in terms of underlying computational "phenotypes," such as their sensitivities to time and effort? If this is not feasible with the current dataset, it would still be valuable to discuss whether these factors could plausibly account for the observed patterns, based on existing literature.

      We thank the reviewer for this interesting question. We first note that the notion of reward in motor control is quite broad. Although our task did not include explicit external (e.g. monetary) rewards, we assumed that participants attribute an implicit value to completing the task in accordance with the experimenter’s instructions. This assumption has been shown to be appropriate for characterising baseline behavior in previous studies [2–5].

      As discussed in the Introduction, vigor is generally understood to emerge from a tradeoff between effort, accuracy, and time. The reviewer is correct in noting that inter-individual differences in vigor may reflect differences in reward sensitivity or in its discounting [3,6], given that time and reward are intrinsically coupled. Differences in vigor may also arise from inter-individual variability in sensitivity to effort or perceived task difficulty. Because these factors are intertwined—for example, increasing accuracy through co-contraction typically incurs greater effort [7])—it is challenging to disentangle their respective contributions based solely on behavioral data.

      In the present study, our inverse optimal control procedure to identify the cost of time (and thus predict individuals’ vigor) relies on a predefined effort-accuracy tradeoff under fixed final time across multiple movement amplitudes [8]. As a result, the model does not allow us to independently estimate individual sensitivities to effort, accuracy, and time. Such characterization of computational "phenotypes" would likely require experimental paradigms in which each of these factors is systematically manipulated while the others are held constant, which is beyond the scope of the current dataset. In practice, the main value of behavioral modeling lies in revealing the relative weighting of these criteria by the CNS during motor planning [5]. We have expanded the Discussion to clarify these limitations and considerations (see Discussion p.12, l.396–401 & l.407–412).

      Finally, we chose not to emphasize these broader issues in the present manuscript because (i) they are peripheral to our primary research question on how individual vigor influences human-human interaction, and (ii) although we do not yet have definitive and consensual answers, they have been addressed in multiple studies reviewed elsewhere [9,10].

      (3) The observation that dyads did not lose accuracy or smoothness despite changes in vigor is interesting and suggests a shift in the speed-accuracy tradeoff. Could the authors include accuracy and smoothness measures in the main figures rather than only in supplementary materials? I think it would make the manuscript more complete.

      We also find that the preservation of accuracy and smoothness despite changes in vigor is an interesting result, and we therefore chose to report these measures in the Supplementary Materials. However, we believe it is preferable not to include them in the main figures for the following reasons:

      - We avoid framing our results in terms of a speed-accuracy trade-off, as Fitts’ work was initially designed to study fast movements [11], whereas our work focuses on self-paced movements. As outlined in the Introduction, vigor is more appropriately interpreted as reflecting a tradeoff between effort (related to movement speed), accuracy, and time. From this perspective, the reported changes of vigor already capture a shift in the underlying trade-off selected by the CNS, using a framework better suited to our experimental paradigm.

      - The manuscript is technically dense and reports multiple analyses that are essential to establish (i) the existence and definition of dyadic vigor, and (ii) how it emerges from interaction between partners. Although the observed preservation of accuracy and improvements in smoothness are informative, they are not central to these two primary questions and would risk diverting attention from the core contributions of the paper. In addition, accuracy is not a feature predicted by our deterministic modeling and extensions would be needed to capture these aspect. Here we only attempted to replicate average behaviors.

      (4) It is a bit unclear to me whether the variance assumptions for ANOVAs were checked, for instance, in Figure 3H.

      We thank the reviewer for this comment, which prompted us to verify the assumptions underlying our ANOVAs. We found that a few distributions in the original analysis, as well as in some of the new tests, did not meet these assumptions. To ensure consistency, all statistical analyses have now been replaced with non-parametric tests: Friedman and Kruskal-Wallis tests for paired and unpaired main effects, Wilcoxon and Mann-Whitney tests for paired and unpaired post-hocs. The updated results do not change any of the conclusions. the only minor change is accuracy, that appeared slightly improved in a restricted number of connected conditions, and now appears mostly non-impacted.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      Minor points:

      (1) Lines 146-147. The authors state, "Whereas the fast partners maintained a similar duration". Figures S6H,I suggest that fast partners made slower movements during the paired task relative to the solo task, not movements with a similar duration.

      We agree that Fig. S.6H,I suggest slightly slower movements for the fast partners, though not significant. We have modified the sentence to be less assertive than in the previous version (see p.6, l.155).

      (2) In the Discussion (Lines 318-319), the authors state that their findings confirm and extend the "benefits of dyadic control in collaborative actions". What benefits are they referring to here, relative to individual control? It would be helpful if the authors would elaborate on this claim.

      We have modified this sentence to clarify that the benefits of dyadic control refer to previously reported advantages over individual control, namely reduced movement time Reed and Peshkin (2008) [12] and improved tracking accuracy [13,14] (see p.11, l.336–337).

      (3) On Lines 87-89, the authors reference a decomposition of variance of vigor scores across the NF1, VL, and VH conditions; however, I did not see an explanation of how this decomposition was performed. The method used to estimate variance explained by inter-individual vs intra-individual differences in vigor should be outlined for the reader.

      Thank you for pointing out this missing information. We now explain in the statistical analysis section (see p.14, l.504–507), that the percentage of inter-individual variability in vigor is estimated using sum-square values as an estimation of inter- and intra-individual variability.

      (4) How was the absolute interaction torque for a paired movement calculated? Was it an integral of the temporal profile of torque for some portion of the combined movement? The method for calculating the absolute interaction torque needs to be specified.

      We have now clarified in the Methods (see p.14, l.490–491) that the reported average interaction effort was computed as the absolute value of the interaction torque as a function of time averaged over the entire movement.

      (5) Lines 123-124: "... interaction torque showed no significant correlation with differences in individual vigor within dyads." This statement should be supported by appropriate statistical measures.

      This result is now supported by reporting the corresponding Pearson correlation analyses. No significant correlations were found between interaction torque and differences in individual vigor within dyads (KL conditions: |r| < 0.43, p> 0.22; KH conditions: |r| < 0.18, p > 0.61, see p.5, l.132–133).

      (6) For the analysis, presented in Figure 3C, and specified on lines 116-123, the text mentions the main effects of both condition and target. There doesn’t appear to be much of an effect of the target for the KH data. Should these results not be reported as an interaction effect between the two factors instead?

      We agree with the reviewer and have corrected our presentation of these results (see p.4, l.126–128). Consistent with the reviewer’s observation, no significant effect of the target is found in the KH condition.

      (7) Figures 3E and S6B. What is the purpose of including the averaged data for each pair in addition to both individuals’ data from each pair? It would be useful to distinguish the individual data from the average data for each pair. Frankly, the number of data points shown on this sub-figure is excessive.

      There may have been a misunderstanding. Because the partners of a dyad are connected by a virtual elastic band (rather than a rigid bar), they do not execute identical movements. Therefore Figs. 3E,S6B display the movement time of all individual participants, together with the corresponding 20 individual regression lines, like in Fig. 2B. The solid black line represents the average across all individuals, and the averaged behaviors of dyads are not included. We have clarified this point by revising the caption of Fig. 3E (see p.5).

      Noted mis-spellings:

      Figure S.3A caption: "trials towards this target."

      Page 10 Line 313: "Importantly, these findings show ...".

      These mis-spellings have been corrected at supplementary p.2 and main text p.11, l.331. Thank you!

      Reviewer #2 (Recommendations for the authors):

      (1) To illustrate the contribution of the three components used to calibrate the overall cost function, it would be informative to include simulation analyses in which each component is selectively removed (i.e., ablation analyses).

      We did not perform ablation analyses, as selectively removing components of the model can lead to instability or ill-suited control inputs, making the resulting simulations difficult to interpret. Instead, we conducted a sensitivity analysis of the key parameters shaping the overall cost function, including the estimated mean and deviation of the slow partner’s movement duration, the weight associated with uncertain torque minimization (Figs. S.18,S.19), and the fast partner’s cost of time (Fig. S20). This analysis reveals the predominant roles of the estimated slow partner movement patterns in determining the model predictions, in agreement with our experimental observations.

      (2) Although the authors refer to the motor-off condition as "passive," participants actively generated the movements in the absence of external forces. Thus, this condition corresponds to active, unassisted movement. A different term may therefore reduce potential confusion for readers.

      We agree that term “passive” was not well-chosen given the context of the paper, thus we have instead replaced this denomination as “null-field” condition. Consequently, the P1 and P2 blocks are now referred to as NF1 and NF2.

      (3) Please clarify the instructions given to participants. Were they informed in advance that their movements would physically interact with those of their partner?

      Thank you for pointing out this missing clarification. We have now specified in the Methods (p.14, l.465–469) that participants were not informed prior to any condition that they would interact with a human partner; they were only told that the robot would provide assistance. When debriefed at the end of the experiment, only one out of the 20 participants reported having realized that they were connected to another human. Most participants believed they were interacting either with a version of themselves or with a robot with some randomness.

      (4) Line 475. Should "Fig. 2D" be "Fig. 2B"?

      Thank you for catching this error. The reference has been corrected to Fig. 2B (see p.15, l.522).

      Reviewer #3 (Recommendations for the authors):

      (1) The analysis of reaction times shows no difference between groups in the passive block, which challenges the assumption that movement vigor covaries with decision speed or action initiation speed. It may be worth discussing this in the context of recent literature.

      We agree that the initial analysis and discussion of reaction times were too superficial. In the revised manuscript, we now report that dyadic interaction leads to significantly shorter reaction times (p.4, l.100–109), concomitantly with improved movement velocity. We have also expanded the Discussion, on the relationship between decision and action speeds/durations (p.11, l.340–348).

      (2) Many abbreviations are unusual for a non-expert. I would recommend using the full terms instead. At least initially, I found it difficult to follow the results because the abbreviations were not immediately clear (at least to me).

      We agree that the paper had to many abbreviations. Therefore, we have removed the abbreviated names of the models and, when possible without impacting the readability, used the full names of the conditions.

      (3) Relatedly, the notation in Figure 1 may be confusing. The labels "S" and "F" (slow and fast) correspond to different concepts than "F" and "L" (follower and leader), so the same participant could be labeled "F" as fast but not "F" as a leader.

      Thank you for pointing out this potential source of confusion. We have therefore modified Fig. 1A (p.2) to avoid any potential confusion by using the full model names rather than abbreviations. In the remainder of the manuscript, "S" and "F" exclusively denote the slower and faster partners within a dyad, and we do not use abbreviations for "leader" or "follower" in the text.

      (4) In figures like 2.C and 3.I, keeping the same scales on the x and y axes and adding a diagonal reference line would make it easier to see shifts across conditions.

      As explained in the Methods, vigor scores in the low- and high-viscosity conditions were computed using the average movement durations from the NF1 condition as a reference. Consequently, because movements are slower in these conditions, the corresponding vigor values are lower than those in NF1. For this reason, using identical scales on the x- and y-axes and adding a 45◦ reference line could mislead the reader in thinking that the vigor scores are expected to be identical and reduce the readability of the figure.

      (5) Multiple hypotheses about dyadic regulation of vigor are nicely explained; it could help to indicate if any of these were a priori favored based on prior literature.

      Previous literature provides mixed evidence regarding how vigor might be regulated in dyadic interaction. For instance, Takagi et al. (2016) [15] reported that mechanically connected partners may rely on independent motor plans, which corresponds to the co-activity hypothesis considered here. However, in that study, movement duration was prescribed. We therefore expected that removing this constraint on movement duration could allow coordination strategies to emerge, particularly in view of findings on haptic communication during tracking of random targets while connected via an elastic band [13,14].

      At the same time, a large body of work on human–human and human–robot interaction has interpreted coordination through a leader–follower framework. In our context, vigor is understood as the outcome of a tradeoff between effort and elapsed time, with time being associated with a decaying reward. Based on this framework, we hypothesized a priori that a leader–follower scheme would emerge, in which the fast partner—being more sensitive to time costs and/or less sensitive to effort—would tend to drive the interaction, even at the expense of increased effort. For these reasons, the leader–follower hypothesis was formulated as the expected outcome throughout the manuscript.

      (6) In the introduction, statements such as "relative vigor of an individual is remarkably stable" appear true only in the solo condition. The same is true in the discussion where it is said that vigor is a stable trait. The whole study show that an individual can shift his/her vigor to the same vigor of another individual, so it doesn’t appear stable to me in such conditions but adaptable.

      Let us first clarify that when we describe vigor as “remarkably stable”, we do not imply that individuals do not adjust their movement timing in response to changes in external dynamics. For example, movement durations increase in visco-resistive conditions even during solo performance; nevertheless, individuals who move faster in the absence of resistance will remain faster relative to others when resistance is introduced. In this sense, stability refers to the preservation of relative rankings across conditions, rather than invariance of absolute movement timing. Because interaction with another individual constitutes a substantial change in task dynamics, an effect on individual pace is therefore expected.

      Told that (and as pointed to by the reviewer) (i) dyadic interactions lead to the emergence of a dyadic vigor characterized by average movement durations close to those of the fast partners, while the ranking across dyads is largely imposed by the slow partners; and (ii) these adaptations persist after the interaction phase. Importantly, the observed vigor adaptations appear to last longer in our physical interaction task than in previous attempts to manipulate vigor using visual feedback [16]. To account for this adaptability of vigor, we have (i) clarified claims in the Introduction regarding the stability of vigor (see p.1, l.18–20), and (ii) expanded the Discussion to more explicitly address vigor adaptability and the possible resulting consequences for the concept of vigor (see p.12, l.407–412).

      References

      (1) O. Labaune, T. Deroche, C. Teulier, and B. Berret, “Vigor of reaching, walking, and gazing movements: on the consistency of interindividual differences,” Journal of Neurophysiology, vol. 123, pp. 234–242, jan 2020.

      (2) L. Rigoux and E. Guigon, “A model of reward-and effort-based optimal decision making and motor control,” PLoS Computational Biology, vol. 8, pp. 1–13, Jan. 2012.

      (3) R. Shadmehr, J. J. O. de Xivry, M. Xu-Wilson, and T.-Y. Shih, “Temporal discounting of reward and the cost of time in motor control,” Journal of Neuroscience, vol. 30, pp. 10507–10516, aug 2010.

      (4) B. Berret and G. Baud-Bovy, “Evidence for a cost of time in the invigoration of isometric reaching movements,” Journal of Neurophysiology, vol. 127, pp. 689–701, feb 2022.

      (5) D. Verdel, O. Bruneau, G. Sahm, N. Vignais, and B. Berret, “The value of time in the invigoration of human movements when interacting with a robotic exoskeleton,” Science Advances, vol. 9, sep 2023.

      (6) K. Jimura, J. Myerson, J. Hilgard, T. S. Braver, and L. Green, “Are people really more patient than other animals? evidence from human discounting of real liquid rewards,” Psychonomic Bulletin & Review, vol. 16, pp. 1071–1075, dec 2009.

      (7) P. L. Gribble, L. I. Mullin, N. Cothros, and A. Mattar, “Role of cocontraction in arm movement accuracy,” Journal of Neurophysiology, vol. 89, pp. 2396–2405, may 2003.

      (8) B. Berret and F. Jean, “Why Don’t We Move Slower? The Value of Time in the Neural Control of Action,” Journal of Neuroscience, vol. 36, pp. 1056–1070, Jan. 2016.

      (9) R. Shadmehr and A. A. Ahmed, Vigor : neuroeconomics of movement control. The MIT Press, 2020.

      (10) D. Thura, A. M. Haith, G. Derosiere, and J. Duque, “The integrated control of decision and movement vigor,” Trends in Cognitive Sciences, vol. 29, pp. 1146–1157, Dec. 2025.

      (11) P. M. Fitts, “The information capacity of the human motor system in controlling the amplitude of movement,” Journal of Experimental Psychology, vol. 47, pp. 381–391, June 1954.

      (12) K. B. Reed and M. A. Peshkin, “Physical collaboration of human-human and human-robot teams,” IEEE Transactions on Haptics, vol. 1, pp. 108–120, July 2008.

      (13) G. Gowrishankar, A. Takagi, R. Osu, T. Yoshioka, M. Kawato, and E. Burdet, “Two is better than one: physical interactions improve motor performance in humans,” Scientific Reports, vol. 4, Jan. 2014.

      (14) A. Takagi, G. Ganesh, T. Yoshioka, M. Kawato, and E. Burdet, “Physically interacting individuals estimate the partner’s goal to enhance their movements,” Nature Human Behaviour, vol. 1, pp. 1–6, Mar. 2017.

      (15) A. Takagi, N. Beckers, and E. Burdet, “Motion plan changes predictably in dyadic reaching,” PLOS ONE, vol. 11, p. e0167314, Dec. 2016.

      (16) P. Mazzoni, B. Shabbott, and J. C. Cortes, “Motor control abnormalities in Parkinson’s disease,” Cold Spring Harbor Perspectives in Medicine, vol. 2, pp. a009282–a009282, Mar. 2012.

    1. eLife Assessment

      This valuable work extends a previously published regression framework for trial-aligned photometry data incorporating functional variables. However, the evidence is generally incomplete, due to the way that within-trial changes in variables have been incorporated into an inherently cross-trial analysis framework, which will limit general adoption. The ideas in this work will be of interest to researchers analyzing photometry signals.

    2. Reviewer #1 (Public review):

      Summary:

      The authors aimed to extend a prior fiber photometry analysis process they developed by incorporating the new ability to determine instantaneous, within trial, relationships between the photometry signal and continuously changing variables. They present solid evidence via simulations and example use cases from published datasets highlighting that their approach can capture instantaneous relationships. Overall, while they make a compelling case that this approach is less biased and more insightful, the implementation for many experimentalists remains challenging enough and may limit widespread adoption by the community.

      Strengths:

      This work builds on prior efforts to analyze photometry signals in a less biased and more statistically sound way. This work incorporates a very important aspect by avoiding the need to summarize individual trials with singular behavioral variables and instead allows for interactions with continuously changing variables to be investigated. The knowledge and expertise of the authors and the presentation provide strong validity and strength to the work. Examples from prior studies in the field are a necessary and important component of the work.

      Weaknesses:

      While use cases are provided from prior data, a clearer presentation of how common implementations in the field are performed (i.e. GLM) and how one could alternatively use the cFLMM approach would help. Otherwise, most may continue using common approaches of Pearson's correlations and GLMs.

    3. Reviewer #2 (Public review):

      The paper presents a regression-based approach for analysing fiber photometry data termed Concurrent Functional Mixed Models (cFLMMs). The approach works by fitting linear mixed effect models separately to each time point in trial aligned data, then applying smoothing to the model coefficients (betas), and computing confidence intervals. The method extends the authors previous work on using FLMMs for photometry data analysis by allowing for the inclusion of predictors whose value changes across timepoints within a trial, rather than just from trial to trial. As fiber photometry is a rapidly expanding field, developing principled methods to analyse photometry data is valuable, particularly as the authors have released an R package that implements their method to facilitate their use by other groups. The basic FLMM approach for using mixed effects models to analyse trial aligned photometry data, detailed by the authors in their previous manuscript (Loewinger et al. 2025, doi: 10.7554/eLife.95802) appears valuable. The aim of incorporating variables that change within trial into this framework is interesting, and the technical implementation appears to be rigorous. However, I have some reservations as to whether the way in which variables that change within trial have been integrated into the analysis framework is likely to be widely useful, and hence how impactful the additional functionality of cFLMM relative to the previously published FLMM will be.

      In the original FLMM approach, where predictors change only from trial-to-trial, fitting separate regressions at each timepoint generates a timeseries of betas is for each predictor, indicating when and how the predictor explained variance across the trial. This makes a lot of sense and is widely used in neuroscience data analysis. In extending this approach to incorporate variables that change within trial, the authors have used the same method of fitting separate regression models at each timepoint, to obtain a timeseries of betas for each predictor. It is less clear that this approach makes sense for variables that change within trial. This is because the resulting betas only capture how variation in the predictor across trials at a given timepoint explains variance in the signal, but does not capture effects of variation in the predictor across timepoints within trials. This partitioning of variance in the predictor into a between-trial component whose effect on the signal is modelled, and a within-trial component whose effect on the signal is not, is artificial in many experiment designs, and may yield hard to interpret results.

      Consider e.g. the experimental condition considered in Figure 3, taken from Machen et al. 2025 (doi: 10.1101/2025.03.10.642469) in which mice ran down a linear track to collect rewards. In analysing such data, one might want to know how neural activity covaried with the animal's position, but as this variable changes strongly within trial but will have a similar time-course across trials, the cFLMM analysis approach will not work to quantify these effects. This is because variance attributed to position would not capture how neural activity covaried with changes in the animals position within trial, but rather how neural activity covaried with changes in the animals position from trial-to-trial at a given timepoint, which could occur due to e.g. trial-to-trial differences in latency to start moving or running speed. As such, although significant effects of 'position' might be observed, they would not capture covariation between position and neural activity in a straightforwardly interpretable way.

      It is therefore not obvious to me that incorporating variables that change within trial into an analysis framework that runs separate regressions at each timepoint in trial aligned data is likely to be widely useful. If scientific questions require understanding how neural activity covaries as a function of variables that change both within and across trials, an alternative approach would be to run a single regression analysis across all timepoints, and capture the extended temporal responses to discrete behavioural events by using temporal basis functions convolved with the event timeseries. This provides a very flexible framework for capturing covariation of neural activity both with variables that change continuously such as position, and discrete behavioural events such as choices or outcomes, while also handling variable event timing from trial-to-trial.

      One way that cFLMM is used in the manuscript is to handle variable timing of trial events in trial aligned data. In the Machen et al. data, the time when the animal reaches the reward varies from trial to trial, and this is represented in the cFLMM analysis by a binary variable which changes value at this timepoint. From the resulting beta coefficient timeseries (Figure 3C) it is not straightforward to understand how neural activity changed as the subject approached and then received the reward. A simpler approach to quantify this, which I think would have yielded more interpretable coefficient timeseries would have been to align activity across trials on when the subject obtained the reward, rather than on the start of the trial, allowing e.g. the effect of reward type to be visualised as a function of time relative to reward delivery, and hence to see the differential effects during approach vs consumption. More broadly, handling variable trial timing in analyses like FLMM which use trial aligned data, can be achieved either by separately aligning the data to different trial events of interest or by time warping the signal to align multiple important timepoints across trials. It is not obvious that using cFLMM with binary indicator variables that indicate when task states changed will yield a clearer picture of neural activity than these methods.

      It may be that I am missing some key strengths of cFLMM relative to the other approaches I have outlined, or that there are applications where this approach to implementing within-trial variable changes is a natural formalism. However my impression is that while cFLMM represent a technical advance, it is not clear how widely useful the model formalism will be.

    4. Reviewer #3 (Public review):

      Summary:

      This work is an extension of their previous study (Loewinger et al 2025) describing a statistical framework for the analysis of photometry data using functional linear mixed models with joint confidence intervals, together with an open-source tool implemented in R. The present study extends it by adding the possibility of using 'concurrent' variables (variables that change within a trial) as regressors, for example, capturing the change of speed at each timepoint in the trial. The main claim is that using 'concurrent' regressors can identify associations between signal and behavior that could not be captured by 'non-concurrent' regressors (the value for a regressor on a specific trial is the same for each timepoint), which could lead to misleading conclusions. While the motivation for using time-varying covariates is useful and supported by previous literature (using fixed-effects models, although not cited in this manuscript), the reanalysis of previous studies does not clearly prove the benefit of using concurrent regressors as opposed to non-concurrent, and some of the results are difficult to interpret.

      Strengths:

      • The motivation for using time-varying covariates is well supported by previous literature using them on fixed-effects models, and here the authors are extending it to mixed-effects models.<br /> • The authors have included this new functionality in their previous open-source R package.

      Weaknesses:

      • The main weakness of this study is that it is not clear what the conceptual or methodological advance of this work is. As it is written, the manuscript focuses on showing how concurrent regressors offer interpretation advantages over non-concurrent regressors. While the benefit of such time-varying regressors is supported by previous literature (e.g., Engelhard et al., 2020), it is not clear whether the examples provided in the current study clearly support the advantage of one over the other, especially in the reanalysis of Machen et al. (2025), where the choice of regressors is confusing. In this specific example, if the question is about speed and reward type, why variables such as latency to reward or a binary 'reward zone vs corridor' (RZ) regressors are used instead of concurrent velocity (or peak velocity - in the case of the non-concurrent model)? Furthermore, if timing from trial start to reward collection is variable, why not align to reward collection, which would help in the interpretation of the signal and comparison between methods? Furthermore, while for the non-concurrent method, the regressors' coefficients are shown, for the concurrent one, what seems to be plotted are contrasts rather than the coefficients. The authors further acknowledge the interpretational difficulties of their analysis.<br /> • Because the relation between behavioral variables and neuronal signal is not instantaneous, previous literature using fixed effects uses, for example, different temporal lags, splines, and convolutional kernels; however, these are not discussed in the manuscript.<br /> • From the methods, it seems that in the concurrent version of fastFMM, both concurrent and non-concurrent regressors can be included, but this is not discussed in the manuscript.<br /> • The methodological advance is not clearly stated, apart from inputting into fastFMM a 3D matrix of regressors x trial x timepoint, instead of a 2D matrix of regressors x trial.<br /> • This manuscript is neither a clear demonstration of the need for concurrent variables, nor a 'tutorial' of how to use fastFMM with the added extension.

    5. Author response:

      Common responses:

      We thank the editors for considering our paper and the reviewers for their thoughtful and detailed feedback. Based on the comments, we will revise our manuscript to better describe how our approach differs from modeling strategies that are common in the field. We also aim to elaborate on the advantages of fastFMM and what scientific questions it is designed to answer. Finally, we will provide more background on our example analyses and the interpretation of the results.

      Within this response, “within-trial timepoints”, “time-varying predictors/behaviors”, and “signal magnitude” are used as specific examples of the general concepts of functional domain”, “functional co-variates”, and “functional outcome”, respectively. To make statements or examples more concrete, we may use the former neuroscience-specific terms when making general claims about functional models.

      - ncFLMM, cFLMM: non-concurrent or concurrent functional linear mixed models.

      - FUI: fast univariate inference. An approximation strategy to perform FLMM Cui et al. (2022).

      - fastFMM the R package that implements FUI.

      - CI confidence interval.

      Before specific line-by-line responses, we provide a brief comparison between cFLMM and fixed effects encoding models. All three reviewers suggested that fixed effects models could be an existing alternative to cFLMM (Reviewer 1 (1B), Reviewer 2 (2C), Reviewer 3 (3A)). Their shared comments highlight that our revision should articulate the advantages and applications of cFLMM relative to existing analysis strategies.

      Functional regression methods like cFLMM produce functional coefficient estimates that quantify how the magnitude of predictor-signal associations evolve across an ordered functional domain such as within-trial timepoints. Standard scalar outcome regression methods, like the GLMs specified in Engelhard et al. (2019), model these associations and their corresponding coefficients as fixed across the functional domain. While GLM encoding models may include time-varying predictors, these analysis strategies do not model the predictor–signal association as changing over the functional domain.

      Moreover, encoding models are less suited to hypothesis testing in clustered or longitudinal settings (e.g., repeated-measures datasets) and yield regression coefficient estimates that are only interpretable with respect to the units of the basis functions. In contrast, cFLMM provides time-varying coefficient estimates that are interpretable as statistical contrasts in terms of the original variables and produces hypothesis tests in clustered settings. cFLMM can be applied to datasets that define covariates in terms of the same flexible representations of covariates used in encoding models; this is a modeling choice rather than a methodological characteristic.

      The remainder of this provisional author response will respond to reviewers’ concerns line-by-line, approximately in the order they appear.

      Reviewer #1 (Public review):

      We thank Reviewer 1 for their comments, especially their efforts to provide first-hand experience with loading and applying fastFMM. We hope that recent improvements to fastFMM’s public release and vignettes address Reviewer 1’s concerns about ease-of-use.

      (1A) Overall, while they make a compelling case that this approach is less biased and more insightful, the implementation for many experimentalists remains challenging enough and may limit widespread adoption by the community.

      We believe the reviewer may have experimented with an old version of fastFMM, so their experience may not reflect recent rewrites and improvements. fastFMM v1.0.0+ is now stable, validated on CRAN, and contains new example data and step-by-step tutorials. We designed fastFMM’s model-fitting code to be similar to common GLM packages in R to reduce the learning curve for new users.

      (1B) …a clearer presentation of how common implementations in the field are performed (i.e. GLM) and how one could alternatively use the cFLMM approach would help.

      We will provide a clearer description of existing methods in the revised manuscript. Briefly, inference with fastFMM can accommodate large datasets that contain clustered data, repeated measures, or complex hierarchical effects, e.g., experiments with multiple animals and multiple trials per animal. When encoding models are fit to each cluster (e.g., animal, neuron) separately, we are not aware of a principled method to pool these cluster-specific models together to quantify uncertainty or yield an appropriate global hypothesis test.

      Reviewer #2 (Public review):

      Reviewer 2’s thoughtful feedback helped structure our points in the common response above, which we will refer to when applicable. In our response, we aim to clarify the problems that cFLMM solves and characterize the advantages in interpretability.

      (2A) The aim of incorporating variables that change within trial into this framework is interesting, and the technical implementation appears to be rigorous. However, I have some reservations as to whether the way in which variables that change within trial have been integrated into the analysis framework is likely to be widely useful, and hence how impactful the additional functionality of cFLMM relative to the previously published FLMM will be.

      We hope that the common response addresses these concerns. We were motivated to provide a concurrent extension of fastFMM based on our experience with statistical consulting in neuroscience research. Questions that benefit from a functional approach are common and often not adequately modeled with a non-concurrent approach, such as the variable trial length analysis we describe below.

      (2B) It is less clear that this approach makes sense for variables that change within trial…This partitioning of variance in the predictor into a between-trial component whose effect on the signal is modeled, and a within-trial component whose effect on the signal is not, is artificial in many experiment designs, and may yield hard to interpret results.

      We thank Reviewer 2 for highlighting a point that we did not adequately explain and that we will address further in the revision. The pointwise and joint CIs estimated by fastFMM account for uncertainty in the coefficient estimates due to variation in the predictors across within-trial timepoints. cFLMM targets a statistical quantity, or estimand, that is defined by trial timepoint specific effects, so the first step of our estimation strategy fits separate pointwise mixed models. However, models from every within-trial timepoint are then combined to calculate uncertainty and smooth the coefficient estimates. Thus, the widths of the pointwise and joint CIs depend on the estimated between-timepoint covariance and a smoothing penalty. Loewinger et al. (2025a) provides further details in Appendices 2 and 3, describing the covariance structure and detailing the power improvements of FUI compared to multiple-comparisons corrections.

      Other functional regression estimation strategies jointly fit the entire model with a single regression, e.g., functional generalized estimating equations Loewinger et al (2025b). However, these methods use basis expansions of the coefficients. In contrast, the encoding models mentioned in 2C below and Reviewer 3 (3A) apply basis-expansions of the covariates, and the resulting model does not capture how signal–covariate associations evolve across some functional domain. Although the first stage in the fastFMM approach fits pointwise linear models, this is only one of three steps in the estimation strategy. fastFMM yields coefficient estimates comparable to those that would be obtained from functional regression estimation strategies that jointly estimate the functional coefficients in a single regression. We mention this to distinguish between the target statistical quantity (functional coefficients) and the estimation strategy (pointwise vs. joint).

      (2C) …an alternative approach would be to run a single regression analysis across all timepoints, and capture the extended temporal responses to discrete behavioural events by using temporal basis functions convolved with the event timeseries. This provides a very flexible framework for capturing covariation of neural activity both with variables that change continuously such as position, and discrete behavioural events such as choices or outcomes, while also handling variable event timing from trial-to-trial.

      Our understanding is that the suggested approach aims to quantify the association between the outcome and within-trial patterns in covariates. This is a great question and we will incorporate a discussion of this into the revision. However, temporal basis functions convolved with the covariate time series cannot directly characterize these relationships. Encoding models can detect the contribution of predictors to neural signals while remaining agnostic to the precise relationship, but this flexibility can come at the cost of interpretability. The coefficients of the convolutions may not be translatable into a clear statistical contrast in terms of the original covariates.

      In our paper, we provide examples of cFLMM models with simple signal-covariate relationships. The coefficient estimates quantify the expected change in signal given a one unit change in the original predictors. Let 𝑌(𝑠) be the outcome and 𝑋(𝑠) be some covariate at within-trial timepoint 𝑠. For brevity, we will suppress subject/trial indices and random effects in the following notation. The coefficient at time point 𝑠 can be captured by the generic mean model

      𝔼[𝑌(𝑠) ∣ 𝑋(𝑠) = 1] − 𝔼[𝑌 (𝑥)|𝑋(𝑠) = 0].

      In contrast, the change in signal associated with patterns in within-trial covariates can be written as

      𝔼[𝑌 (𝑠<sub>1</sub>) ∣ 𝑋(𝑠<sub>2</sub>) = 1] − 𝔼[𝑌 (𝑠<sub>1</sub>) ∣ 𝑋(𝑠<sub>2</sub>) = 0]

      for all pairs of timepoints 𝑠<sub>1</sub>, 𝑠<sub>2</sub>. While simple lagged or offset outcome-predictor associations can be incorporated as covariates in cFLMM, the approach does not capture all within-trial timepoints 𝑠<sub>1</sub>, 𝑠<sub>2</sub>. Encoding models also do not target the above estimand. Instead, a full function-on-function regression could estimate the above. This topic can be incorporated into our revision and may be a future line of inquiry.

      (2D) In the Machen et al. data…From the resulting beta coefficient timeseries (Figure 3C) it is not straightforward to understand how neural activity changed as the subject approached and then received the reward. A simpler approach to quantify this, which I think would have yielded more interpretable coefficient timeseries would have been to align activity across trials on when the subject obtained the reward. More broadly, handling variable trial timing in analyses like FLMM which use trial aligned data, can be achieved either by separately aligning the data to different trial events of interest or by time warping the signal to align multiple important timepoints across trials.

      In this experiment, mice waited in a trigger zone, ran through a linear corridor, then received a food reward in the reward delivery zone of either water or strawberry milkshake Machen et al. (2026). Mice received different rewards between sessions but the same reward within all trials of a given session. This design complicated the analysis, as the reward type produced prominent differences in average latency (water: 3.3 seconds, milkshake: 2.0 seconds). The authors wanted to disentangle whether mean differences in the signal across reward types reflected differences in motivation to obtain the reward or differences in reaction to reward receipt.

      We agree that performing a reward-aligned analysis would be an intuitive approach to visualize the differences in average signal for mice that received milkshake compared to water. In fact, we provide a ncFLMM reward-aligned analysis in Figure S1 of Machen et al. (2025). We will add this analysis to the revision and thank the reviewer for the suggestion. We emphasize, however, that this method answers a different question. It does not identify how the signal change associated with receiving the milkshake evolves with respect to latency, especially if the relationship is non-linear. Time warping faces similar obstacles in this setting, especially since sufficiently flexible curve registration can induce similarity due purely to noise. Generally, time warping does not lend itself to hypothesis testing as it is unclear how to propagate uncertainty from the time warping model into final hypothesis tests.

      We believe cFLMM is an appropriate choice for the specific question, and we will revise the manuscript to better reflect its advantages. The functional coefficient estimates in Figures 3C-iii and 3C-iv provide insights that are not possible to derive from the proposed alternatives. For example, we can infer that for short latencies, we do not see a significant difference in signal magnitude for mice receiving water and mice receiving the milkshake. However, for latencies longer than around 2 seconds, receiving the milkshake is associated with an additional positive change in signal. We agree that we should make Figure 3C and the accompanying discussion more clear and thank Reviewer 2 for their feedback on interpretation.

      Reviewer 3 (Public review):

      (3A) …it is not clear what the conceptual or methodological advance of this work is. As it is written, the manuscript focuses on showing how concurrent regressors offer interpretation advantages over non-concurrent regressors. While the benefit of such time-varying regressors is supported by previous literature (e.g., Engelhard et al., 2020), it is not clear whether the examples provided in the current study clearly support the advantage of one over the other…

      We assume Reviewer 3 is referencing “Specialized coding of sensory, motor and cognitive variables in VTA dopamine neurons Engelhard et al. (2019). We hope that the Common response sufficiently contrasts the settings where each approach can be applied. Because these models have different goals and assumptions, they are appropriate for answering different questions.

      (3B) In this specific example, if the question is about speed and reward type, why variables such as latency to reward or a binary “reward zone vs corridor” (RZ) regressors are used instead of concurrent velocity (or peak velocity - in the case of the non-concurrent model)? Furthermore, if timing from trial start to reward collection is variable, why not align to reward collection, which would help in the interpretation of the signal and comparison between methods? Furthermore, while for the non-concurrent method, the regressors' coefficients are shown, for the concurrent one, what seems to be plotted are contrasts rather than the coefficients. The authors further acknowledge the interpretational difficulties of their analysis.

      Thank you for pointing out that we were not clear. This was mentioned by multiple reviewers and highlights the need to elaborate on our motivation in the revision. In this example, we wanted to investigate the change in signal-reward association as a function of within-trial timepoints, not the association between instantaneous velocity and the signal. “Slow” or “fast” means “mouse with below or above average latency”. We ask you to please refer to Reviewer 2 (2C) where we discuss why event alignment is an insufficient correction.

      The functional coefficient estimates in Figure 3C are interpreted as contrasts because the fixed effect coefficients capture the difference in expected signal between strawberry milkshake and water along the functional domain. An advantage of cFLMM is that it is easy to specify models in which the coefficients correspond to interpretable contrasts of the signal across conditions. The coefficient estimate shown in Figure 3B-ii also corresponds to a contrast because the estimates capture the difference in mean signal from strawberry milkshake and water. Equations (7) and (8) in the section “Materials and methods” and sub-section “Variable trial length analysis” provide additional details on the fixed effect coefficients. Based on this confusion, we will convert the two 1 x 4 sub-plots of 3B and 3C into two 2 x 2 sub-plots to avoid unintended direct comparisons.

      To contextualize how we “acknowledge the interpretational difficulties of [our] analysis”, we stated that a non-concurrent FLMM attempting to control for a time-based covariate is difficult to interpret. The concurrent FLMM provides a straightforward interpretation directly related to the question of interest, which we discuss above in Reviewer 2 (2D).

      (3C) Because the relation between behavioral variables and neuronal signal is not instantaneous, previous literature using fixed effects uses, for example, different temporal lags, splines, and convolutional kernels; however, these are not discussed in the manuscript.

      Thank you for this suggestion. All three reviewers raised this topic (see Reviewer 1 (1B), Reviewer 2 (2C), and the Common responses), and we will incorporate our response in the revision.

      (3D) From the methods, it seems that in the concurrent version of fastFMM, both concurrent and non-concurrent regressors can be included, but this is not discussed in the manuscript.

      This is an important point that we mentioned implicitly. In our cFLMM specification of the Jeong et al. (2022) model, “we incorporated trial-specific covariates for trial number and session, modeling these as increasing numerical values rather than identical categorical variables”, which are also plotted in Appendix 3. In Box 1, “if the functional covariate of interest is a scalar constant across the domain, the models fit by the concurrent and non-concurrent procedure are identical”. We will explicitly point out that cFLMM can perform inference on combinations of functional and constant covariates.

      (3E) The methodological advance is not clearly stated, apart from inputting into fastFMM a 3D matrix of regressors x trial x timepoint, instead of a 2D matrix of regressors x trial.

      Prior to our work described in this Research Advance, it was not obvious that the existing approximation approach in fastFMM could be generalized to cFLMM. During the writing of the article, a fastFMM user reached out for help with producing pseudo-concurrent FLMMs by duplicating rows in a nonconcurrent model, which both underscores the unmet need for cFLMMs and the difficulty in fitting them with available tools.

      The “under-the-hood” differences are described in Appendix 4. Concurrent FLMM with fast univariate inference was theoretically possible as early as Cui et al. (2022). The univariate step was straightforward, but guaranteeing “fast” and “inference” was not. We needed to verify, for example, that the method-of-moments estimation of the random effects covariance matrix generalized to cFLMM, which is not a trivial step. Characterizing whether the method achieved asymptotic coverage required extensive simulation studies (Figure 4, Appendix 2). Future work may focus on fully characterizing the asymptotic convergence in high noise or high complexity regimes.

      (3F) This manuscript is neither a clear demonstration of the need for concurrent variables, nor a 'tutorial' of how to use fastFMM with the added extension.

      We hope that the Common responses clarifies how cFLMM compares to existing approaches and fills a gap in the data analysis landscape for neuroscience. The fastFMM R package vignettes contain example analyses, and we intend for these files to be work in tandem with the manuscript. To provide more guidance for interested analysts, we can explicitly reference these tutorials within the revision.

      Planned revisions

      The following summary is not exhaustive.

      Writing additions:

      Per 1B, 2C and 3A, the Common responses will be incorporated in the revision.

      Per 2B, we will discuss function-on-function regression and explore how to estimate statistical contrasts for complex within-trial relationships. Relatedly, we will clarify that the CIs in fastFMM are constructed using an estimate of the within-trial covariance of the predictors, and clarify the definition of pointwise and joint CIs.

      Per 3D, we will explicitly state that concurrent FLMMs can include covariates that are constant over within-trial timepoints.

      Though we cannot prescribe a universally correct model selection procedure, we will mention that AIC, BIC, and other summary statistics can inform the specification of the random effects.

      Analysis modifications:

      Parts of Appendix 3 may be included in Figure 2 to directly address the question investigated by Jeong et al. (2022) and Loewinger et al (2024).

      When discussing Machen et al. (2025) data, the supplementary analysis with reward-aligned ncFLMM models might be added to clarify the ncFLMM/cFLMM difference.

      Per \ref{rvw2:encoding}, the additional analysis aimed at disentangling latency and reward in Machen et al.’s variable trial length data may be incorporated as an additional sub-figure in Figure 3.

      Aesthetic changes:

      Figure 3 will be reorganized to avoid unintended direct comparisons between the coefficients of the non-concurrent and concurrent model.

      Citations for Machen et al. (2026) will be updated to reflect publication of the preprint.

      The version number for fastFMM will be updated.

      References

      Cui E, Leroux A, Smirnova E, Crainiceanu CM. Fast Univariate Inference for Longitudinal Functional Models. Journal of Computational and Graphical Statistics. 2022; 31(1):219–230. https://doi.org/10.1080/10618600.2021.1950006, doi: 10.1080/10618600.2021.1950006, pMID: 35712524.

      Engelhard B, Finkelstein J, Cox J, Fleming W, Jang HJ, Ornelas S, Koay SA, Thiberge SY, Daw ND, Tank DW, Witten IB. Specialized coding of sensory, motor and cognitive variables in VTA dopamine neurons. Nature. 2019 Jun; 570(7762):509–513. https://www.nature.com/articles/s41586-019-1261-9, doi: 10.1038/s41586-019-1261-9.

      Jeong H, Taylor A, Floeder JR, Lohmann M, Mihalas S, Wu B, Zhou M, Burke DA, Namboodiri VMK. Mesolimbic dopamine release conveys causal associations. Science. 2022; 378(6626):eabq6740. https://www.science.org/doi/abs/10.1126/science.abq6740, doi: 10.1126/science.abq6740.

      Loewinger G, Cui E, Lovinger D, Pereira F. A statistical framework for analysis of trial-level temporal dynamics in fiber photometry experiments. eLife. 2025 Mar; 13:RP95802. doi: 10.7554/eLife.95802.

      Loewinger G, Levis AW, Cui E, Pereira F. Fast Penalized Generalized Estimating Equations for Large Longitudinal Functional Datasets. ArXiv. 2025 Jun; p. arXiv:2506.20437v1. https://pmc.ncbi.nlm.nih.gov/articles/PMC12306803/.

      Machen B, Miller SN, Xin A, Lampert C, Assaf L, Tucker J, Herrell S, Pereira F, Loewinger G, Beas S. The encoding of interoceptive-based predictions by the paraventricular nucleus of the thalamus D2R+ neurons. iScience. 2026 Jan; 29(1):114390. doi: 10.1016/j.isci.2025.114390.

    1. eLife Assessment

      This manuscript explores the dynamic behaviors of Pol II and Pol III puncta that encompass the SL1 and 5S genes, following up on the authors' prior studies on ATTF-6. The authors show that ATTF-6 is required for RNA Pol II but not RNA Pol III foci, demonstrating that within the gene cluster, the regulation of RNA Pol II and RNA Pol III remain distinct from each other. The study is useful for analyzing understudied gene families, but it is incomplete and needs additional edits and experiments.

    2. Reviewer #1 (Public review):

      This study examines how two types of RNA polymerases organize themselves within the nucleus of C. elegans cells, building directly on the same group's prior publication and largely functioning as a companion to that earlier work. While the observation that the two polymerases occupy distinct but neighboring locations at the same genomic region adds nuance to our understanding of gene cluster regulation, the manuscript would benefit from more clearly delineating which findings are new versus continuations of previously published work. Protein localization, gene expression effects, and genomic mapping data appear to overlap substantially with the earlier paper.

      The condensate claims would also benefit from additional experimental support. Demonstrating fusion events and concentration-dependent assembly are now standard expectations in the field. Additionally, one measurement reported appears inconsistent with a condensate model, warranting further discussion.

      Some findings would benefit from more interpretive context. Why does polymerase clustering fluctuate with the cell cycle? What are the functional implications of ATTF-6 being required for one polymerase's foci but not the others?

      The elevated-temperature experiments are intriguing but difficult to interpret, as the temperature used is well-established as a broad stress trigger in this organism. Acknowledging this and considering additional controls would help clarify whether the observed effects are specific to foci behavior.

      Finally, the manuscript would be strengthened by adding quantification to some figures and revising the model diagram to better reflect what the current data support.

    3. Reviewer #2 (Public review):

      Summary:

      The researchers analyzed GFP-tagged RNA Pol II and RNA Pol III catalytic subunits RPB-1 and RPC-1, and showed that they form foci in early embryo nuclei that overlap with the 5S rDNA loci and foci by ATTF-6-RFP. They showed foci are round, dissolve upon hexanediol incubation, and are detected during S phase, removed during, and re-established after mitosis. The researchers performed FRAP and showed fast exchange of polymerases, unlike ATTF-6. They show that, unlike RNA Pol III, RNA Pol II foci are dependent on ATTF-6 and temperature sensitive. The researchers propose that the two polymerases form distinct foci with different biochemical dependencies. This study shows that, although closely located within a gene cluster, the regulation of RNA Pol II and RNA Pol III is independent.

      Strengths:

      The researchers provide high-quality images that support the main results. The researchers' use of auxin-inducible and RNAi depletion work is validated in the same embryos by fluorescent analysis of the target protein.

      Weaknesses:

      Although the researchers propose the hypothesis that the RNA Pol II and RNA Pol III form distinct condensates, alternative hypotheses are not presented, and the criteria by which the other possibilities are ruled out are not discussed.

    4. Reviewer #3 (Public review):

      Wang et al demonstrate that RNA polymerase II and RNA polymerase III form distinct nuclear foci at the 5S rDNA-SL1 gene cluster in C. elegans. By ChIP, Pol II is highly enriched at the SL1 gene, whereas Pol III is enriched at the 5S rRNA gene. Both polymerase foci are spherical, show rapid exchange in FRAP experiments, and assemble in a cell-cycle-dependent manner, predominantly during S phase. The transcription factors ATTF-6 and SNPC-4 are required for the formation of Pol II foci but are dispensable for Pol III foci. Pol II foci, but not Pol III foci, are temperature-sensitive and dissolve upon heat stress; dissolution correlates with a strong reduction of SL1 transcription, whereas 5S rRNA levels remain largely unaffected.

      Overall, this is a clean, well-organized, and well-controlled study, and I only have two comments.

      (1) Roundness measurements, FRAP, and sensitivity to 1,6-hexanediol are indicative but not sufficient to show that these foci are condensates. They could, for example, also be scaffolded /chromatin-anchored assemblies (see https://pubmed.ncbi.nlm.nih.gov/36526633/). Please either provide better evidence or rephrase/tone down the condensate statements.

      (2) Image quantification is only provided for Figure 5, but should also be reported for Figures 6 and 7. In addition to the foci number, also, e.g., intensity over background (similar to partition coefficient) should be quantified.

    5. Author response:

      Reviewer #1:

      We appreciate the reviewer’s suggestions. In the revision, we will clarify which results are new and better position this work relative to our earlier publication. We will also expand the discussion of the functional implications of polymerase clustering and its cell-cycle dynamics.

      Regarding the condensate interpretation, we agree that the current evidence is suggestive but not definitive. In the revised manuscript, we will clarify how our measurements relate to commonly used criteria for condensate assemblies and revise the text to avoid overstating this interpretation. We will also add quantification to additional figures and revise the model diagram to more accurately reflect the conclusions supported by the data.

      Reviewer #2:

      We thank the reviewer for the positive assessment of the imaging quality. We agree that the manuscript would benefit from a broader discussion of possible models for the observed polymerase foci. In the revision, we will expand the discussion to include alternative interpretations, such scaffolded assemblies as suggested by the reviewer 3, and further clarify the properties of the RNA Pol II and RNA Pol III foci.

      Reviewer #3:

      We thank the reviewer for the positive evaluation of the study and the helpful suggestions. We agree that the current evidence is indicative but not sufficient to definitively demonstrate condensate formation. In the revision, we will revise the language and discuss alternative interpretations, including scaffolded assemblies. We will also provide additional quantifications for the relevant figures.

      Overall, we appreciate the reviewers’ suggestions and believe that the planned revisions will improve the clarity and impact of the manuscript.

    1. eLife Assessment

      This fundamental work uncovers an unexpected lysosomal function for NINJ2 and links it to ferroptosis and cancer biology. The evidence supporting the conclusions appears to be convincing. Additional mechanistic clarification, particularly around the NINJ2-LAMP1 interaction and ferroptosis specificity, will further strengthen the manuscript. This work will be of general interest to the community of ferroptosis and cancer biology.

    2. Reviewer #1 (Public review):

      Summary:

      This study reports a novel and potentially impactful role for NINJ2 in maintaining lysosomal integrity and regulating cellular susceptibility to ferroptosis. The authors demonstrate that NINJ2 localizes to lysosomes and interacts with LAMP1, a key lysosomal membrane glycoprotein involved in sensing lysosomal stress. Loss of NINJ2 increases lysosomal membrane permeabilization (LMP), resulting in selective leakage of lysosomal contents, including labile iron, into the cytosol. The authors further show that NINJ2 deficiency reduces the expression of ferritin storage proteins, thereby sensitizing cells to ferroptosis induced by RSL3 and erastin. Collectively, the work proposes a mechanistic link between NINJ2-mediated control of LMP, iron homeostasis, and ferroptotic vulnerability, with potential relevance to cancer biology.

      Strengths:

      This study identifies a novel role for NINJ2 in regulating lysosomal integrity and ferroptosis and establishes a mechanistic link between lysosomal membrane permeabilization, iron homeostasis, and ferroptotic sensitivity, with potential translational relevance in cancer.

      Weaknesses:

      The results overall support the authors' conclusions and provide a plausible mechanistic framework; however, additional quantification of Western blot data and further discussion of mechanistic questions would strengthen the study.

      The findings are likely to have a broad impact by linking lysosomal integrity to ferroptosis and iron homeostasis, both of which are relevant to cancer biology and therapeutic targeting.

    3. Reviewer #2 (Public review):

      This manuscript, "Nerve Injury-Induced Protein 2 preserves lysosomal membrane integrity to suppress ferroptosis", identifies a previously unrecognized function of NINJ2 as a regulator of lysosomal membrane integrity and iron homeostasis, thereby suppressing ferroptosis. The authors demonstrate that NINJ2 localizes to lysosomes, interacts with LAMP1, limits lysosomal membrane permeabilization (LMP), stabilizes ferritin, and protects cells from ferroptotic cell death. They further extend these mechanistic findings to human cancer datasets, showing co-overexpression and positive correlation of NINJ2 with ferritin genes in iron-addicted cancers.

      Overall, the study is conceptually interesting, technically solid, and integrates cell biology, iron metabolism, and ferroptosis in a coherent framework. The work expands the functional repertoire of the Ninjurin family beyond plasma membrane rupture and inflammation, which will be of interest to researchers in cell death, lysosome biology, and cancer metabolism.

      Strengths:

      (1) The identification of NINJ2 as a lysosome-associated protein that suppresses ferroptosis represents a meaningful advance beyond its previously described roles in inflammation, pyroptosis, and tumorigenesis.

      (2) The work distinguishes NINJ2 functionally from NINJ1, reinforcing the idea that structurally related Ninjurins have divergent membrane-related roles.

      (3) The study presents a logically connected pathway:<br /> NINJ2 loss → LMP → labile iron increase → ferritin degradation → ferroptosis sensitization, which is well supported by the data.

      (4) The link between LAMP1, ferritin turnover, and ferroptosis is particularly compelling and timely given recent interest in lysosomal contributions to ferroptotic signaling.

      (5) The authors use confocal microscopy, proximity ligation assays, biochemical IPs, iron measurements, protein half-life analyses, ferroptosis assays, and TCGA-based analyses, providing convergent evidence for their model.

      (6) Use of two distinct cell lines (MCF7 and Molt4) strengthens generalizability.

      (7) The integration of cancer expression datasets linking NINJ2 with ferritin expression in hepatocellular and breast carcinomas enhances translational relevance.

      (8) Assigning NINJ2 a lysosomal protective function, distinct from NINJ1-mediated plasma membrane rupture, is novel.

      (9) Linking NINJ2 to ferroptosis regulation via lysosomal iron handling, rather than canonical GPX4 or system Xc⁻ pathways, is also novel, along with proposing a NINJ2-LAMP1-ferritin axis as a buffering mechanism against iron-driven lipid peroxidation.

      (10) These insights are not incremental; they reframe how NINJ2 may function at the intersection of membrane biology, iron metabolism, and regulated cell death.

      Areas for improvement:

      While the study is strong, several issues should be addressed for mechanistic depth and general relevance.

      (1) Although NINJ2 is shown to interact with LAMP1 and LAMP1 knockdown rescues ferritin levels, it remains unclear whether the NINJ2-LAMP1 interaction is required for lysosomal protection. The authors could:<br /> a) Map the NINJ2 domain required for LAMP1 interaction and test whether an interaction-deficient mutant fails to protect against LMP and ferroptosis.<br /> b) Rescue NINJ2 KO cells with wild-type versus mutant NINJ2 to establish causality.

      (2) The conclusion that NINJ2 suppresses ferroptosis relies primarily on RSL3 and Erastin sensitivity. A direct assessment of ferroptosis would hence the study, such as:<br /> a) Include ferroptosis rescue experiments using ferrostatin 1 or liproxstatin 1.<br /> b) Assess lipid peroxidation directly (e.g., C11 BODIPY staining) to strengthen the ferroptosis claim.

      (3) The manuscript discusses lysosomal ferritin degradation but does not directly examine NCOA4, a central mediator of ferritinophagy. It would be good to:<br /> a) Test whether NCOA4 knockdown rescues ferritin loss and ferroptosis sensitivity in NINJ2 KO cells.<br /> b) This would clarify whether NINJ2 acts upstream of canonical ferritinophagy pathways or via an alternative mechanism.

      (4) The study is entirely cell-based, despite references to inflammatory and tumor phenotypes in Ninj2-deficient mice. While not strictly required, even limited in vivo validation (e.g., ferroptosis markers or iron accumulation in existing Ninj2 KO tissues) would substantially strengthen the manuscript.

      (5) Finally, most imaging data (e.g., Galectin 3/LAMP1 colocalization, PLA signals) and immunoblot data are presented qualitatively. The authors should provide the qualifications of Western blots and other measurements.

    4. Author response:

      Reviewer #1:

      We appreciate the reviewer’s insightful suggestions. In the revised manuscript, we will provide quantitative analysis of Western blot data throughout the study to improve data robustness and reproducibility. In addition, we will expand the “Discussion” session to address the following points raised by the reviewer #1: (1) Potential mechanisms underlying the regulation of LAMP1 transcript levels by NINJ2; (2) Whether Ninjurin1 may play a similar role in regulating lysosomal membrane permeabilization (LMP); (3) The potential clinical implications of our findings, particularly in relation to cancer progression and therapeutic targeting.

      Reviewer #2:

      We thank the reviewer for the insightful and constructive suggestions, which would further deepen the mechanistic understanding of the NINJ2-LAMP1 pathway and its role in ferroptosis regulation. To address the reviewer’s concerns, we will clarify the interpretation of our findings, add quantitative analyses where appropriate, and expand the Discussion to acknowledge these important mechanistic questions and future research directions. Specifically, we will revise the Statistical Analysis section to clearly describe the statistical methods used, including whether corrections for multiple comparisons were applied where appropriate. We will further discuss the potential interaction domain(s) between NINJ2 and LAMP1. We will also discuss the potential role of NCOA4, a central mediator of ferritinophagy, in the NINJ2-FTH1-LAMP1 pathway. Finally, we will include a schematic model summarizing the proposed NINJ2-LAMP1-iron-ferroptosis axis to better illustrate the working model of our study.

    1. eLife Assessment

      This important study addresses the long-debated hypothesis that humans preferentially choose partners with dissimilar immune genes, using data from a small-scale society that allows comparison between arranged and self-chosen partnerships. Across multiple analyses controlling for genome-wide relatedness and examining functional immune diversity, the authors find no evidence of HLA/MHC-based (dis)assortative mating, suggesting that immune gene variation has limited influence on mate choice in this relatively homogeneous population and that the observed patterns instead reflect selection acting directly on immune loci. While the strength of the evidence is compelling for this population, several conclusions rely on indirect reconstruction methods and imputed data for a very complex region of the genome, which may limit how firmly some claims can be supported.

    2. Reviewer #1 (Public review):

      Summary:

      This study aims to test whether human mate choice is influenced by HLA similarity while accounting for genome-wide relatedness, using the Himba as an evolutionarily relevant small-scale society population, unique among most HLA-mate choice studies. By comparing self-chosen ("love") and arranged marriages and using NGS-based 8-locus HLA class I and II sequences and genome-wide SNP data, the authors ask whether partners who freely choose each other are more HLA-dissimilar than those paired through social arrangements or random pairs. They further extend their work by examining functional differences in peptide-binding divergence among pairs and predicted pathogen recognition in potential offspring.

      Strengths:

      This study has many strengths. The most obvious is their ability to test for HLA-based mate choice in the Himba, a non-European, non-admixed, small-scale society population, the type of population that has been missing, in my opinion, from the majority of HLA mate choice studies. While Hedrick and Black (1997) used a similarly evolutionarily relevant remote tribe of native South Americans, they only considered 2 class I loci (HLA-A and HLA-B) at the first typing field (serological allele group) and did not have data for genome-wide relatedness. The Himba are also unique among previously studied populations because they have both socially arranged and self-chosen partnerships, so the authors could test if freely-chosen partners had lower MHC-similarity than assigned or randomly chosen partners.

      Another key strength of the study was the relatively large sample size (HLA allele calls from 366 individuals, 102 unrelated) and 219 individuals with HLA data, whole genome SNP data, and involved in a partnership.

      The study was also unique among HLA-mate choice studies for comparing peptide binding region protein divergence (calculated as the Grantham distance between amino acid sequences) among partner types and randomly generated pairs. This was also the first time I have seen a study use peptide binding prediction analysis of relevant human pathogens for potential offspring among partners to test if there would be a pathogen-relevant fitness benefit of partner selection.

      Weaknesses:

      My main concerns relate to the reliance on imputed HLA haplotypes and on IBD-based metrics in a region of the genome where both approaches are known to be problematic.

      First, several key results depend on HLA haplotypes inferred through imputation rather than directly observed sequence data. The authors trained HIBAG imputation models on Himba SNP data across the full 5 Mb HLA region using paired HLA allele calls from target capture sequencing (L251-253). However, the underlying SNP data were generated by mapping reads to a 1000 Genomes Yoruba reference, meaning that both SNP discovery and subsequent imputation depend on the haplotypes represented in that reference panel. As a result, the imputation framework is likely biased toward common haplotypes shared between the Himba and Yoruba populations, while rare or Himba-specific HLA alleles are less likely to be imputed accurately or at all. This limitation has been noted previously for HLA imputation, particularly for novel or low-frequency variants and for populations that are poorly represented in reference panels. While the authors compare (first-field) imputed alleles to sequenced alleles to assess imputation accuracy, this validation step itself may be biased toward the same common haplotypes that are easiest to impute. This becomes especially problematic if IBD is inferred using imputed haplotypes, because haplotype sharing would then primarily reflect common, reference-supported haplotypes, while true population-specific variation would be effectively invisible. In this scenario, downstream estimates of IBD sharing may be inflated for common haplotypes and deflated for rare ones, potentially biasing conclusions about haplotype sharing, selection, and mate choice at the HLA region.

      Second, the interpretation of excess identity-by-descent (IBD) sharing in the HLA region is difficult given the well-documented genomic properties of this locus. The classical HLA region is highly gene-dense, structurally complex, and characterized by extreme heterogeneity in recombination rates, with pronounced hot- and cold-spots (Miretti et al. 2005; de Bakker et al. 2006, reviewed in Radwan et al. 2020). Elevated IBD in such regions can arise from low recombination, background selection, or demographic processes such as bottlenecks, all of which can mimic signals of recent positive selection. While the authors suggest fluctuating or directional selection, extensive haplotype sharing is also consistent with long-term balancing selection at the MHC (Albrechtsen et al. 2010) or recent demographic history in this population.

      Beyond these main issues, there are several additional concerns that affect interpretation. Sample sizes and partnership counts are sometimes unclear; some figures would benefit from clearer scaling (Figure 1) and annotation (Figures S6 and S7), and key methodological choices (e.g., treatment of DRB copy number variation, no recombination correction in IBD calling) require further explanation. Finally, some conclusions, particularly those invoking optimality or specific selective mechanisms, are not directly tested by the analyses presented and would benefit from more cautious framing.

    3. Reviewer #2 (Public review):

      Summary:

      Evidence for the influence of MHC on mate choice in humans is challenging, as social structures and norms often confound the power of studying populations. This study uses an unusual, diverse, but relatively isolated population that allows a direct comparison of arranged and chosen partners to determine if MHC diversity is increased when choice drives mate choice. Overall, the authors use a range of genetic analyses to determine individual relationships alongside different measures of MHC diversity and potential selection pressures. The overall finding that there is no heterozygous dissimilarity difference between arranged and chosen partners. There is evidence of positive selection that may be a stronger driver, or at least it may mask other selection forces.

      Strengths:

      A rare opportunity to study human mate choice and genetic diversity. An excellent range of data and analysis that is well applied, and all results point to the same conclusion.

      Overall, this is a very well-written and concise paper when considering the significant amount of data and excellent analysis that has been undertaken.

      Weaknesses:

      (1) For the type of samples and data available, none are obvious.

      (2) Although this paper is clearly focused on humans, I was expecting more discussion around the studies that have been undertaken in animals. It is likely that between populations and species, there are different pressures that have driven the MHC evolution, but also mate choice.

      (3) The peptide presentation based on pathogen genomes is interesting but usually not significant. I wondered if another measure of MHC haplotype diversity to complement this would be the overall repertoire of peptides that could be presented, pathogen-based or otherwise. There is usually significant overlap in the peptides that can be presented, for example, between HLA-A and HLA-B, and this may reveal more significant differences between the alleles and haplotype frequencies.

    4. Reviewer #3 (Public review):

      The study investigates MHC-related mate choice in humans using a sample of couples from a small-scale sub-Saharan society. This is an important endeavour, as the vast majority of previous studies have been based on samples from complex, highly structured societies that are unlikely to reflect most of human evolutionary history. Moreover, the study controls for genome-wide diversity, allowing for a test of the specificity of the MHC region, as theoretically predicted. Finally, the authors examine potential fitness benefits by analysing predicted pathogen-binding affinities. Across all analyses, no deviations from random pairing are detected, suggesting a limited role for MHC-related mate choice in a relatively homogeneous society. Overall, I find the study to be carefully executed, and the paper clearly written. Nevertheless, I believe the paper would benefit if the following points were considered:

      (1) The authors claim (p. 2, l. 85) that their study is the first to employ a non-European small-scale society. I believe this claim is incorrect, as Hendrick and Black (1997) investigated MHC similarity among couples from South American indigenous populations.

      (2) Regarding the argument that in complex societies, mating with a random individual would already result in sufficient MHC dissimilarity (p. 2, 78), see the paper from Croy et al. 2020, which used the largest sample to date in this research area.

      (3) Dataset. As some relationships are parallel, I assume that certain individuals entered the dataset multiple times. This should be explicitly reported in the Methods. If I understand the analyses correctly, this non-independence was addressed by including individual identity as a random effect in the model - the authors should confirm whether this is the case. I am also wondering to what extent so-called "discovered partnerships" may affect the results. Shared offspring may be the outcome of short or transient affairs and could have a different social status compared with other informal relationships. Would the observed patterns change if these partnerships were excluded from the analyses?

      (4) How many pairs were due to relatedness closer than 3rd degree? In addition, why was 4th degree relatedness used as a threshold in some of the other analyses?

      (5) I was surprised by the exclusion of HIV, given that Namibia has a very high prevalence of HIV in the general population (e.g., Low et al. 2021).

      (6) It appears that age criteria were applied when generating random pairs (p. 8, l. 350). Could the authors please specify what they consider a realistic age gap, and on what basis this threshold was chosen? As these are virtual couples used solely to estimate random variation within the population, it is not entirely clear why age constraints are necessary. Would the observed patterns change if no age criteria were applied?

      (7) I think it would be helpful for readers if the Results section explicitly stated that real couples did not differ from randomly generated pairs. At present, only the comparison between chosen and arranged pairs is reported.

      (8) I appreciate the separate analyses of pathogen-binding properties for MHC class I and class II, given their functional distinctiveness. For the same reason, I would welcome a parallel analysis of MHC sharing conducted separately for class I and class II loci.

      (9) I think the Discussion would benefit from a more detailed comparison with previous studies. In addition, the manuscript does not explicitly address limitations of the current study, including the relatively limited sample size given the extensive polymorphism in the MHC region.

      References:

      Hedrick, P. W., & Black, F. L. (1997). HLA and mate selection: no evidence in South Amerindians. The American Journal of Human Genetics, 61(3), 505-511.

      Croy, I., Ritschel, G., Kreßner-Kiel, D., Schäfer, L., Hummel, T., Havlíček, J., ... & Schmidt, A. H. (2020). Marriage does not relate to major histocompatibility complex: A genetic analysis based on 3691 couples. Proceedings of the Royal Society B, 287(1936), 20201800.

      Low, A., Sachathep, K., Rutherford, G., Nitschke, A. M., Wolkon, A., Banda, K., ... & Mutenda, N. (2021). Migration in Namibia and its association with HIV acquisition and treatment outcomes. PLoS One, 16(9), e0256865.

    5. Author response:

      Reviewer 1 (Public review):

      Summary:

      This study aims to test whether human mate choice is influenced by HLA similarity while accounting for genome-wide relatedness, using the Himba as an evolutionarily relevant small-scale society population, unique among most HLA-mate choice studies. By comparing self-chosen ("love") and arranged marriages and using NGS-based 8-locus HLA class I and II sequences and genome-wide SNP data, the authors ask whether partners who freely choose each other are more HLA-dissimilar than those paired through social arrangements or random pairs. They further extend their work by examining functional differences in peptide-binding divergence among pairs and predicted pathogen recognition in potential offspring.

      Strengths:

      This study has many strengths. The most obvious is their ability to test for HLA-based mate choice in the Himba, a non-European, non-admixed, small-scale society population, the type of population that has been missing, in my opinion, from the majority of HLA mate choice studies. While Hedrick and Black (1997) used a similarly evolutionarily relevant remote tribe of native South Americans, they only considered 2 class I loci (HLA-A and HLA-B) at the first typing field (serological allele group) and did not have data for genome-wide relatedness. The Himba are also unique among previously studied populations because they have both socially arranged and self-chosen partnerships, so the authors could test if freely-chosen partners had lower MHC-similarity than assigned or randomly chosen partners.

      Another key strength of the study was the relatively large sample size (HLA allele calls from 366 individuals, 102 unrelated) and 219 individuals with HLA data, whole genome SNP data, and involved in a partnership.

      The study was also unique among HLA-mate choice studies for comparing peptide binding region protein divergence (calculated as the Grantham distance between amino acid sequences) among partner types and randomly generated pairs. This was also the first time I have seen a study use peptide binding prediction analysis of relevant human pathogens for potential offspring among partners to test if there would be a pathogen-relevant fitness benefit of partner selection.

      Weaknesses:

      My main concerns relate to the reliance on imputed HLA haplotypes and on IBD-based metrics in a region of the genome where both approaches are known to be problematic.

      First, several key results depend on HLA haplotypes inferred through imputation rather than directly observed sequence data. The authors trained HIBAG imputation models on Himba SNP data across the full 5 Mb HLA region using paired HLA allele calls from target capture sequencing (L251-253). However, the underlying SNP data were generated by mapping reads to a 1000 Genomes Yoruba reference, meaning that both SNP discovery and subsequent imputation depend on the haplotypes represented in that reference panel. As a result, the imputation framework is likely biased toward common haplotypes shared between the Himba and Yoruba populations, while rare or Himba-specific HLA alleles are less likely to be imputed accurately or at all. This limitation has been noted previously for HLA imputation, particularly for novel or low-frequency variants and for populations that are poorly represented in reference panels. While the authors compare (first-field) imputed alleles to sequenced alleles to assess imputation accuracy, this validation step itself may be biased toward the same common haplotypes that are easiest to impute. This becomes especially problematic if IBD is inferred using imputed haplotypes, because haplotype sharing would then primarily reflect common, reference-supported haplotypes, while true population-specific variation would be effectively invisible. In this scenario, downstream estimates of IBD sharing may be inflated for common haplotypes and deflated for rare ones, potentially biasing conclusions about haplotype sharing, selection, and mate choice at the HLA region.

      We appreciate the reviewer's concern, but would like to clarify two important misunderstandings in this assessment.

      First, the reviewer suggests that our SNP data were generated by mapping reads to a 1000 Genomes Yoruba reference, and that IBD inference may therefore be biased toward haplotypes common between the Himba and Yoruba. This is not the case. Our SNP genotype data were generated from the H3Africa and MEGAex genotyping arrays, which incorporated diverse reference variation to minimize ascertainment bias in non-European ancestries. No read mapping to a Yoruba reference genome was involved in SNP discovery or genotyping. The Yoruba 1000 Genomes data were used solely to provide an ancestry-matched recombination map for phasing and IBD calling–this would not bias IBD inference toward common Yoruba haplotypes. The reviewer's concern about imputation-driven inflation of IBD sharing for common haplotypes should not be relevant in our case.

      Second, regarding HLA haplotype resolution: we trained a bespoke HIBAG model directly on the Himba SNP array genotype data paired with ground-truth HLA allele calls from our own targeted HLA capture sequencing. This Himba-specific model was then used to impute HLA alleles from pseudo-homozygous genotypes derived by extracting phased SNP-based haplotypes across the HLA region for the same individuals. In this way we resolved the phase of the HLA allele calls.. To our knowledge, this paired-data approach to individual-level HLA haplotype resolution is novel; existing HLA haplotype resolution tools generally provide only population-level haplotype frequency estimates rather than individual-level phase assignments. We are confident in the reliability of the haplotypes we report. Resolved haplotypes were required to match the known targeted-sequencing HLA allele calls at a minimum of the first field for at least one allele, and both haplotypes could not be assigned to the same allele unless the individual's HLA allele calls were homozygous. Of 722 total haplotypes, 698 were successfully resolved under these criteria. We report results only on these confidently resolved haplotypes.

      Second, the interpretation of excess identity-by-descent (IBD) sharing in the HLA region is difficult given the well-documented genomic properties of this locus. The classical HLA region is highly gene-dense, structurally complex, and characterized by extreme heterogeneity in recombination rates, with pronounced hot- and cold-spots (Miretti et al. 2005; de Bakker et al. 2006, reviewed in Radwan et al. 2020). Elevated IBD in such regions can arise from low recombination, background selection, or demographic processes such as bottlenecks, all of which can mimic signals of recent positive selection. While the authors suggest fluctuating or directional selection, extensive haplotype sharing is also consistent with long-term balancing selection at the MHC (Albrechtsen et al. 2010) or recent demographic history in this population.

      We thank the reviewer for highlighting the difficulty in modeling selection at the HLA - a problem that deserves considerable attention. We acknowledge that demographic processes such as the documented Himba population bottleneck can result in elevated IBD sharing (Swinford et al. 2023, PNAS). However, our comparison of HLA IBD sharing rates against a genome-wide baseline is designed to address this: demographic processes affect all regions of the genome, so if the HLA region maintains elevated IBD sharing significantly above the genome-wide threshold, this provides meaningful evidence for a locus-specific effect beyond demographic history alone.

      We agree with the reviewer that the recombination landscape of the HLA region is complex, but this complexity itself is consistent with the region being a frequent target of selection. Previous HLA analyses have found that at the allele level, frequencies are consistent with balancing selection, while multi-locus haplotype frequencies are consistent with purifying selection and positive frequency-dependent selection (Alter et al., 2017), patterns that contribute to the complex recombination rate heterogeneity observed in the region. Recombination rate can be both a cause of extended haplotypes but also the consequence of selection against combinations of alleles.

      As Alter et al. note, the high levels of linkage disequilibrium observed among HLA alleles serve to limit the amount of diversity within HLA haplotypes, but balancing selection at the allelic level maintains multiple HLA haplotypes at high frequency across populations over long periods of time — so-called "conserved extended haplotypes" as we observe (Supplementary Figures 1 and 9). Regarding the specific selective mechanism, our results are not equally consistent with all forms of balancing selection. Albrechtsen et al. (2010) explicitly modeled overdominant balancing selection and demonstrated that equilibrium overdominance does not produce elevated IBD sharing as we observe — our results are therefore inconsistent with this mechanism. Instead, Albrechtsen et al. conclude that allele frequency change is required to generate elevated IBD, consistent with bouts of directional selection such as negative frequency-dependent or fluctuating positive selection. We will make explicit that while our findings do not support overdominance, they are consistent with these temporally dynamic forms of selection driving periodic allele frequency change at the HLA locus. We will also incorporate local recombination rate into Figure 4 to provide a comparison of local recombination rate across chromosome 6 with the observed areas of elevated IBD sharing.

      Alter, I., Gragert, L., Fingerson, S., Maiers, M., & Louzoun, Y. (2017). HLA class I haplotype diversity is consistent with selection for frequent existing haplotypes. PLoS computational biology, 13(8), e1005693.

      Beyond these main issues, there are several additional concerns that affect interpretation. Sample sizes and partnership counts are sometimes unclear; some figures would benefit from clearer scaling (Figure 1) and annotation (Figures S6 and S7), and key methodological choices (e.g., treatment of DRB copy number variation, no recombination correction in IBD calling) require further explanation. Finally, some conclusions, particularly those invoking optimality or specific selective mechanisms, are not directly tested by the analyses presented and would benefit from more cautious framing.

      We will clarify the presentation of partnership counts and sample sizes throughout the manuscript and improve the scaling and annotation of the flagged figures. Regarding DRB copy number variation, we will add explicit discussion of our analytical choices and their potential limitations. As described in our responses to the main concerns above, we will also provide more nuanced framing of the selective mechanisms consistent with our IBD results, avoiding conclusions that go beyond what our analyses directly support.

      Reviewer #2 (Public review):

      Summary:

      Evidence for the influence of MHC on mate choice in humans is challenging, as social structures and norms often confound the power of studying populations. This study uses an unusual, diverse, but relatively isolated population that allows a direct comparison of arranged and chosen partners to determine if MHC diversity is increased when choice drives mate choice. Overall, the authors use a range of genetic analyses to determine individual relationships alongside different measures of MHC diversity and potential selection pressures. The overall finding that there is no heterozygous dissimilarity difference between arranged and chosen partners. There is evidence of positive selection that may be a stronger driver, or at least it may mask other selection forces.

      Strengths:

      A rare opportunity to study human mate choice and genetic diversity. An excellent range of data and analysis that is well applied, and all results point to the same conclusion.

      Overall, this is a very well-written and concise paper when considering the significant amount of data and excellent analysis that has been undertaken.

      Weaknesses:

      (1) For the type of samples and data available, none are obvious.

      (2) Although this paper is clearly focused on humans, I was expecting more discussion around the studies that have been undertaken in animals. It is likely that between populations and species, there are different pressures that have driven the MHC evolution, but also mate choice.

      We will improve the framing of our project within the broader non-human MHC mate choice literature in our discussion.

      (3) The peptide presentation based on pathogen genomes is interesting but usually not significant. I wondered if another measure of MHC haplotype diversity to complement this would be the overall repertoire of peptides that could be presented, pathogen-based or otherwise. There is usually significant overlap in the peptides that can be presented, for example, between HLA-A and HLA-B, and this may reveal more significant differences between the alleles and haplotype frequencies.

      We would like to clarify that we did assess the unique pathogen peptides bound across all HLA class I and class II genes by each population's common haplotypes (Figures S12–S13). We acknowledge the reviewer's point that non-pathogenic peptides are also important — for example, binding with self-produced proteins. However, binding with self-produced proteins is more relevant to autoimmune risk, and the selective pressures involved are outside the scope of our current work, which focuses on pathogen-induced fluctuating directional selection and heterozygote advantage. Furthermore, selection on non-pathogenic peptide binding repertoires likely operates in the opposite direction to pathogen repertoire; whereas broader pathogen peptide binding is advantageous, broader self-peptide binding risks excessive immune activation.

      Reviewer #3 (Public review):

      The study investigates MHC-related mate choice in humans using a sample of couples from a small-scale sub-Saharan society. This is an important endeavour, as the vast majority of previous studies have been based on samples from complex, highly structured societies that are unlikely to reflect most of human evolutionary history. Moreover, the study controls for genome-wide diversity, allowing for a test of the specificity of the MHC region, as theoretically predicted. Finally, the authors examine potential fitness benefits by analysing predicted pathogen-binding affinities. Across all analyses, no deviations from random pairing are detected, suggesting a limited role for MHC-related mate choice in a relatively homogeneous society. Overall, I find the study to be carefully executed, and the paper clearly written. Nevertheless, I believe the paper would benefit if the following points were considered:

      (1) The authors claim (p. 2, l. 85) that their study is the first to employ a non-European small-scale society. I believe this claim is incorrect, as Hendrick and Black (1997) investigated MHC similarity among couples from South American indigenous populations.

      We thank the reviewer for this important clarification. Our claim was intended to be more specific: to our knowledge, this is the first study to investigate HLA-based mate preferences in a non-European small-scale society while explicitly controlling for genome-wide relatedness. Hedrick and Black (1997) did not include genome-wide relatedness controls, which is a critical distinction given that ancestry-assortative mating can produce spurious patterns of HLA similarity or dissimilarity in the absence of such correction. We will make this qualification explicit in the revised manuscript.

      (2) Regarding the argument that in complex societies, mating with a random individual would already result in sufficient MHC dissimilarity (p. 2, 78), see the paper from Croy et al. 2020, which used the largest sample to date in this research area.

      We thank the reviewer for this reference. In our revision, we will incorporate Croy et al. (2020) into our discussion and use it as a reference for comparing the Himba’s probability of highly homozygous offspring given population allele frequencies. This comparison will help support our claim that background HLA diversity in the Himba is sufficiently high so that any unrelated partner is already likely to yield adequately dissimilar offspring—a scenario that would reduce the selective benefit of active HLA-based mate choice and could mask any such preference even if it exists.

      (3) Dataset. As some relationships are parallel, I assume that certain individuals entered the dataset multiple times. This should be explicitly reported in the Methods. If I understand the analyses correctly, this non-independence was addressed by including individual identity as a random effect in the model - the authors should confirm whether this is the case. I am also wondering to what extent so-called "discovered partnerships" may affect the results. Shared offspring may be the outcome of short or transient affairs and could have a different social status compared with other informal relationships. Would the observed patterns change if these partnerships were excluded from the analyses?

      The reviewer is correct that individuals appear multiple times in the dataset—some individuals are members of multiple known partnerships, and all individuals are additionally included many times across the full set of possible random heterosexual pairings that meet our age and relatedness criteria. This non-independence is explicitly addressed in our dyadic linear mixed models by including female ID and male ID as random effects, which account for each individual's unique contribution to their similarity scores across all pairings, both real and random. We explain this explicitly in the (n) Statistical Models section of the methods section.

      Regarding discovered partnerships: we grouped these with reported informal partnerships in the current analyses due to modest sample sizes. We agree this is worth examining more carefully and will test, in our revision, whether treating discovered partnerships as a separate category, or excluding them entirely, meaningfully affects our results. We will report these analyses as a sensitivity check.

      (4) How many pairs were due to relatedness closer than 3rd degree? In addition, why was 4th degree relatedness used as a threshold in some of the other analyses?

      This information is reported in the (n) ‘Statistical Models section of the Methods’. No pairs were found to be closer than 3rd degree relatives. No arranged marriages were related at 3rd degree or closer; 1 love match marriage and 2 informal partnerships discovered through pedigree analysis were found to be 3rd degree relatives.

      Regarding the difference in relatedness thresholds: we used a 4th degree cutoff to define the unrelated set of individuals for allele and haplotype frequency analyses (n=102), as even 3rd degree relatives would inflate allele frequency estimates. In contrast, we permitted 3rd degree relatives in the background distribution for the partnership analyses to reflect the stated cultural preference for cousin marriages in arranged unions—excluding them would have made the background distribution less representative of the actual mating pool. We explain both decisions in Methods sections (d) and (n).

      (5) I was surprised by the exclusion of HIV, given that Namibia has a very high prevalence of HIV in the general population (e.g., Low et al. 2021).

      While HIV prevalence is indeed high in Namibia generally, the Himba are a relatively isolated population and, based on personal communication with Dr. Ashley Hazel—who has extensive field experience studying sexually transmitted infections in the Himba (see references 36, 52, 53, and 54)—there is no evidence of HIV transmission within this population. Dr. Hazel's expertise on this question was the basis for our exclusion of HIV from the pathogen list.

      (6) It appears that age criteria were applied when generating random pairs (p. 8, l. 350). Could the authors please specify what they consider a realistic age gap, and on what basis this threshold was chosen? As these are virtual couples used solely to estimate random variation within the population, it is not entirely clear why age constraints are necessary. Would the observed patterns change if no age criteria were applied?

      We will clarify this in our revision, but we restricted random couples to have an age gap within the range observed in actual, known partnerships (the woman is maximum 16 years older than then man and minimum 53 years younger than the man). We included this criteria to make sure random couples represented the best approximation of background, realistic partners. Our age gap criteria was quite permissive due to the large range observed in our actual pairs and we do not imagine it significantly impacted our results.

      (7) I think it would be helpful for readers if the Results section explicitly stated that real couples did not differ from randomly generated pairs. At present, only the comparison between chosen and arranged pairs is reported.

      We would like to clarify that for each analysis we explicitly report both the effects of chosen and arranged partnerships relative to the background distribution intercept, and the pairwise contrast between chosen and arranged partnerships. The intercept of each model is derived from the full background distribution of random opposite-sex pairings meeting our age and relatedness criteria, providing a null expectation under random mating. A non-significant effect for both partnership types therefore indicates that neither arranged nor chosen partnerships differ from random mating with respect to the metric in question. We describe this explicitly in the Statistical Models section of the Methods, but we will ensure this interpretation is stated more prominently in the Results section of the revised manuscript to avoid any confusion.

      (8) I appreciate the separate analyses of pathogen-binding properties for MHC class I and class II, given their functional distinctiveness. For the same reason, I would welcome a parallel analysis of MHC sharing conducted separately for class I and class II loci.

      We can incorporate separate HLA similarity/log odds of homozygous offspring analyses for class 1 and class 2 in our revision.

      (9) I think the Discussion would benefit from a more detailed comparison with previous studies. In addition, the manuscript does not explicitly address limitations of the current study, including the relatively limited sample size given the extensive polymorphism in the MHC region.

      We will expand our discussion in the revision to provide a more detailed comparison with previous studies, including Croy et al. (2020), and will add an explicit limitations section incorporating suggestions from multiple reviewers on more careful framing of optimality and specific selective mechanisms. Regarding sample size, we acknowledge this as a genuine limitation given the extensive polymorphism of the MHC region. However, our unrelated sample size used for allelic diversity estimated is comparable to previous studies in African populations (Figure 1), and our dataset is uniquely comprehensive in combining HLA class I, class II, genome-wide SNP data, and partnership data within the same individuals—a combination that enables the genome-wide relatedness correction that distinguishes our study from much of the prior literature.

      References

      Hedrick, P. W., & Black, F. L. (1997). HLA and mate selection: no evidence in South Amerindians. The American Journal of Human Genetics, 61(3), 505-511.

      Croy, I., Ritschel, G., Kreßner-Kiel, D., Schäfer, L., Hummel, T., Havlíček, J., ... & Schmidt, A. H. (2020). Marriage does not relate to major histocompatibility complex: A genetic analysis based on 3691 couples. Proceedings of the Royal Society B, 287(1936), 20201800.

      Low, A., Sachathep, K., Rutherford, G., Nitschke, A. M., Wolkon, A., Banda, K., ... & Mutenda, N. (2021). Migration in Namibia and its association with HIV acquisition and treatment outcomes. PLoS One, 16(9), e0256865.

    1. eLife Assessment

      This study presents a valuable finding on the condition dependence of autophagy-mediated lifespan regulation in C. elegans. The evidence is solid, as the data broadly support the main claims, although variability between biological replicates and limited mechanistic exploration leave some conclusions less firmly established. The work will be of interest to researchers studying autophagy, ageing, and intracellular trafficking.

    2. Reviewer #1 (Public review):

      Summary:

      Hsiung et al. investigated whether the effects of autophagy gene knockdown on the lifespan of long-lived C. elegans mutants depend on experimental conditions. The authors first compiled published data on autophagy-dependent lifespan regulation in daf-2 and wild-type backgrounds, highlighting that prior results are notably inconsistent and likely context-dependent. They then systematically tested the lifespan effects of RNAi knockdown of six autophagy genes (atg-2, atg-4.1, atg-9, atg-13, atg-18, and bec-1) in wild-type (N2), daf-2 (reduced insulin/IGF-1 signalling), and glp-1 (germlineless) animals, while varying temperature, daf-2 allele, FUDR concentration, and bacterial infection status.

      The key findings are as follows. In wild-type animals, lifespan suppression by most autophagy gene knockdowns was more pronounced at 20{degree sign}C than at 25{degree sign}C, where little or no effect was observed. In daf-2 mutants, stronger lifespan suppression was seen in the weaker daf-2(e1368) allele at 20{degree sign}C, but not in the stronger daf-2(e1370) allele, and effects were largely absent at 25{degree sign}C. In glp-1 mutants, four of six gene knockdowns suppressed lifespan to a greater extent than in N2, though again in a temperature-dependent manner. FUDR at a high concentration (800 µM) abolished the life-shortening effects of most knockdowns and, in the case of atg-9 and atg-13, led to lifespan extension. Kanamycin treatment to eliminate bacterial proliferation did not fully account for the lifespan effects, suggesting that increased susceptibility to infection is not the primary mechanism. The authors also tested the programmed aging hypothesis that autophagy promotes lifespan reduction through biomass repurposing, but found no changes in vitellogenin levels upon knockdown of any of the six genes.

      Altogether, among all genes tested, atg-18 knockdown produced the strongest and most consistent lifespan suppression across nearly all conditions, including both daf-2 and glp-1 backgrounds. The authors probed whether atg-18 acts through the FOXO transcription factor DAF-16 by examining dauer formation and ftn-1 expression, but found no evidence for this, suggesting a DAF-16-independent mechanism.

      Strengths:

      The primary strength of this work lies in its systematic and comprehensive approach to dissecting how experimental variables influence the outcome of autophagy-lifespan epistasis tests. The compilation of prior data alongside the authors' own multi-condition dataset is a genuinely useful resource for the field. The study raises a timely and important point about condition selection bias, which is relevant not only to autophagy research but to C. elegans aging studies more broadly. The finding that atg-18 behaves distinctly from other autophagy genes across all conditions is noteworthy and opens avenues for future mechanistic work.

      Weaknesses:

      Despite its breadth, the study has several weaknesses that limit the strength of some conclusions.

      (1) Variability in control lifespan data. The N2 lifespan values under ostensibly identical conditions (e.g., GFP RNAi at 20{degree sign}C) differ substantially across experiments (compare Tables S2, S5, S6, S7, and S9). Since N2 serves as the baseline for calculating whether the effect is greater in long-lived mutants via Cox proportional hazard (CPH) analysis, this variability in controls directly affects the reliability of those comparisons.

      (2) Limited biological replication. Most experiments were performed with only two biological replicates. In several cases, the two replicates yield contradictory outcomes: one showing significant lifespan suppression and the other showing no effect or even extension. The authors combine these into cumulative datasets for analysis, which, while not incorrect in principle, may obscure genuine irreproducibility. Given that the central message of the paper concerns variability and condition dependence, additional replication would have substantially strengthened confidence in the reported results.

      (3) Low sample sizes in individual trials. A number of lifespan assays were conducted with only 40-50 worms per replicate, and in some cases, as few as 30. Such sample sizes are below the standard commonly used in the C. elegans aging field and are likely to contribute to the variability observed.

      (4) RNAi efficacy measured only in N2 at 20{degree sign}C. The authors demonstrated that atg-2 and atg-4.1 RNAi did not significantly reduce target mRNA levels, which may explain their weaker lifespan effects. However, these same RNAi treatments significantly affected lifespan in several other conditions (e.g., daf-2(e1368) at 20{degree sign}C, glp-1 at 20{degree sign}C and 25{degree sign}C, and N2 with 15 µM FUDR). Measuring RNAi efficacy across different genetic backgrounds and conditions would be needed to properly interpret these variable results.

      (5) Incomplete mechanistic exploration. The investigation of why atg-18 knockdown has uniquely strong effects was limited to DAF-16. Given published evidence that atg-18 may regulate HLH-30/TFEB, a master transcriptional regulator of autophagy and lysosomal biogenesis, testing whether atg-18 specifically affects HLH-30 nuclear localisation or activity could have provided valuable mechanistic insight and would distinguish atg-18 from the other genes tested.

    3. Reviewer #2 (Public review):

      Summary:

      This study examines how genes involved in cellular recycling (autophagy) influence lifespan under different experimental conditions. The findings help clarify why previous studies have reported conflicting results about whether blocking autophagy shortens or extends lifespan. The work will be of interest to researchers studying aging and cellular stress responses, particularly those using model organisms.

      Strengths:

      The findings are valuable, as they help resolve inconsistencies within a specific subfield of aging research. The evidence presented is solid, as the data broadly support the primary claims of the study. In addition, the discussion is thorough and thoughtfully integrates the findings within the broader context of the field.

      Weaknesses:

      Additional functional validation would further strengthen the conclusions.

    1. eLife Assessment

      This study establishes a methodology (machine vision and gaze pose estimation) and behavioral apparatus for examining social interactions between pairs of marmoset monkeys. It has been difficult to study social interactions using artificial stimuli rather than genuine interactions between unrestrained animals. This study makes a fundamental contribution to social neuroscience research in a laboratory setting. Their results are convincing showing that the study of unrestrained social interactions is possible with detailed quantification of position and gaze. The methodology presented here is relevant to research in social neuroscience, neuroethology, and primatology.

    2. Reviewer #1 (Public review):

      Summary:

      The current study by Xing et al. establishes the methodology (machine vision and gaze pose estimation) and behavioral apparatus for examining social interactions between pairs of marmoset monkeys. Their results enable unrestrained social interactions under more rigorous conditions with detailed quantification of position and gaze. It has been difficult to study social interactions using artificial stimuli, as opposed to genuine interactions between unrestrained animals. This study makes an important contribution for studying social neuroscience within a laboratory setting that will be valuable to the field.

      Strengths:

      Marmosets are an ideal species for studying primate social interactions due to their prosocial behavior and the ease of group housing within laboratory environments. They also predominantly orient their gaze through head-movements during social monitoring. Recent advances in machine vision pose estimation set the stage for estimating 3D gaze position in marmosets but requires additional innovation beyond DeepLabCut or equivalent methods. A six point facial frame is designed to accurately fit marmoset head gaze. A key assumption in the study is that head-gaze is a reliable indicator of the marmoset's gaze direction, which will also depend on the eye position. Overall, this assumption has been well supported by recent studies in head-free marmosets. Thus the current work introduces an important methodology for leveraging machine vision to track head-gaze and demonstrates its utility for use with interacting marmoset dyads as a first step in that study.

      Comments on revisions:

      I thank the authors for their careful revisions of the manuscript. It has addressed all of my comments.

      One final suggestion would be to add a scale bar in Supplemental Figure 2A so the size of the video/image stimuli is clear (in cm of monitor size) and also to report a range for how far away was the marmoset in viewing these stimuli (in cm). This will enable calculation of the rough accuracy in visual degrees.

    3. Reviewer #2 (Public review):

      Summary:

      The manuscript describes novel technique development and experiments to track the social gaze of marmosets. The authors used video tracking of multiple cameras in pairs of marmoset to infer head orientation and gaze, and then studied gaze direction as a function of distance between animals, relationships, and social conditions/stimuli.

      Strengths:

      Overall the work is interesting and well done. It addresses an area of growing interest in animal social behavior, an area that has largely been dominated by research in rodents and other non-primate species. In particular, this work addresses something that is uniquely primate (perhaps not unique, but not studied much in other laboratory model organisms), which is that primates, like humans, look at each other, and this gaze is an important social cue of their interactions. As such, the presented work is an important advance and addition to the literature that will allow more sophisticated quantification of animal behaviors. I am particularly enthusiastic about how the authors approach the cone of uncertainty in gaze, which can be both due to some error in head orientation measurements as well as variable eye position

      Weaknesses:

      While there remains some degree of uncertainty in the precise accuracy of the gaze measure, the authors have done an excellent job accounting for these as well as they can, and appropriately acknowledge the limitations of their approach.

      Comments on revisions:

      I have no further recommendations. The authors addressed my previous suggestions or acknowledged them as topics for future investigation. This is excellent work.

    4. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      The current study by Xing et al. establishes the methodology (machine vision and gaze pose estimation) and behavioral apparatus for examining social interactions between pairs of marmoset monkeys. Their results enable unrestrained social interactions under more rigorous conditions with detailed quantification of position and gaze. It has been difficult to study social interactions using artificial stimuli, as opposed to genuine interactions between unrestrained animals. This study makes an important contribution for studying social neuroscience within a laboratory setting that will be valuable to the field.

      Strengths:

      Marmosets are an ideal species for studying primate social interactions due to their prosocial behavior and the ease of group housing within laboratory environments. They also predominantly orient their gaze through head movements during social monitoring. Recent advances in machine vision pose estimation set the stage for estimating 3D gaze position in marmosets but require additional innovation beyond DeepLabCut or equivalent methods. A six-point facial frame is designed to accurately fit marmoset head gaze. A key assumption in the study is that head gaze is a reliable indicator of the marmoset's gaze direction, which will also depend on the eye position. Overall, this assumption has been well supported by recent studies in head-free marmosets. Thus the current work introduces an important methodology for leveraging machine vision to track head gaze and demonstrates its utility for use with interacting marmoset dyads as a first step in that study.

      Weaknesses:

      One weakness that should be easily addressed is that no data is provided to directly assess how accurate the estimated head gaze is based on calibrations of the animals, for example, when they are looking at discrete locations like faces or video on a monitor. This would be useful to get an upper bound on how accurate the 3D gaze vector is estimated to be, for planned use in other studies. Although the accuracy appears sufficient for the current results, it would be difficult to know if it could be applied in other contexts where more precision might be necessary.

      Please see our detailed responses to the reviewer comments below.

      Reviewer #2 (Public review):

      Summary:

      The manuscript describes novel technique development and experiments to track the social gaze of marmosets. The authors used video tracking of multiple cameras in pairs of marmosets to infer head orientation and gaze and then studied gaze direction as a function of distance between animals, relationships, and social conditions/stimuli.

      Strengths:

      Overall the work is interesting and well done. It addresses an area of growing interest in animal social behavior, an area that has largely been dominated by research in rodents and other non-primate species. In particular, this work addresses something that is uniquely primate (perhaps not unique, but not studied much in other laboratory model organisms), which is that primates, like humans, look at each other, and this gaze is an important social cue of their interactions. As such, the presented work is an important advance and addition to the literature that will allow more sophisticated quantification of animal behaviors. I am particularly enthusiastic with how the authors approach the cone of uncertainty in gaze, which can be both due to some error in head orientation measurements as well as variable eye position.

      Weaknesses:

      There are a few technical points in need of clarification, both in terms of the robustness of the gaze estimate, and possible confounds by gaze to non-face targets which may have relevance but are not discussed. These are relatively minor, and more suggestions than anything else.

      Please see our detailed responses to the reviewer comments below.

      Reviewer #1 (Recommendations for the authors):

      Major comments:

      (1) It appears that the accuracy of the estimated gaze angle must be well under the size of the gaze cone (+/- 10 degrees), but I can't find any direct estimate of the accuracy even if it is just a ballpark figure. On Lines 219-233 is where performance is described for viewing images and video on a monitor, where it should be possible to reconstruct the point of gaze on the monitor while images and video are shown, in order to evaluate the accuracy of the system for where the marmoset is looking? Would you see eye position traces that would show fixation clusters around those images or videos with stationary points on the monitor much like that seen for head-fixed animals looking at faces on a screen (Mitchell et al, 2014)? If so, what is the typical spread of those clusters during fixations on an image, both in terms of the precision by RMS error during a fixation epoch and the spread around the images at different locations (accuracy of projection)? For example, if gaze clusters were always above the displayed images one would have an idea that the face plane is slightly offset above the true gaze direction. It is not completely clear how well the face plane and corresponding gaze cone do in describing gaze direction in space, but the monitor stimuli could be used as an initial validation of it.

      We thank the reviewer for this important suggestion regarding the quantitative validation of gaze accuracy. We agree that, when animals view stimuli presented on a monitor, the estimated gaze direction can be evaluated by examining the spatial distribution of gaze–monitor intersection points relative to stimulus locations.

      To address this, we generated a new figure (Fig. S2A) analyzing gaze behavior following the onset of video stimuli presented at different locations on the monitor. Specifically, we selected video clips in which human annotators verified that the marmosets were looking at the monitor. Consistent with prior work in head-fixed marmosets (Mitchell et al., 2014), we observe clustering of gaze–monitor intersection centers within and around the corresponding stimulus locations after stimulus onset. These clusters provide an empirical validation that the estimated gaze direction aligns with stimulus position in space.

      Importantly, unlike the head-fixed preparation used in Mitchell et al. (2014), marmosets in our study were freely moving. As a result, they do not exhibit prolonged, stationary fixations on the monitor, and fixation clusters are therefore more diffuse. This increased spread reflects natural head and body motion rather than limitations of the gaze estimation method itself. Despite this, gaze intersection points remain spatially localized to the vicinity of the presented stimuli across different monitor locations.

      We did observe small offsets in some gaze clusters relative to stimulus centers; however, these offsets were not systematic across stimulus locations or animals. Crucially, there was no consistent bias (e.g., clusters appearing uniformly above or below stimuli) that would indicate a systematic misalignment of the face plane or gaze cone relative to true gaze direction. Together, these observations support the conclusion that the face-plane-based gaze cone provides an accurate estimate of gaze direction in space, with precision well within the ±10° aperture of the gaze cone.

      While the freely moving component of the behavior precludes direct estimation of fixation RMS error comparable to head-fixed paradigms, the observed stimulus-locked clustering serves as an initial validation of both the accuracy and practical utility of our approach under naturalistic conditions.

      (2) A second major comment is about clarity in the writing of the results and discussion. At the end of the manuscript, a major takeaway is the difference between familiar and unfamiliar dyads, that males show more interest in viewing females including unfamiliar females, but for familiar females, this distinction is also associated with being likely to look at them if they look at the male, and then to engage in joint gaze with them after looking at them, which indicates more of a social interaction than simply monitoring them when they are unfamiliar. Those aspects of the results could be emphasized more in the topic sentences of paragraphs presenting data to support those features of the gaze data (at present is buried at the ends of results paragraphs and back in the discussion).

      We thank the reviewer for this insightful suggestion. We have restructured the Results and Discussion sections to lead with the primary social takeaways rather than technical descriptions (Tracked changes in Word). Specifically, we now emphasize the distinction between "social monitoring" (characteristic of unfamiliar dyads) and "active social coordination" (characteristic of familiar dyads).

      (1) Topic Sentences: We revised the topic sentences of all Results paragraphs to immediately highlight the findings regarding male interest and the influence of familiarity on reciprocation.

      (2) Conceptual Framework: We added a conceptual distinction in the Discussion, explaining that while unfamiliar marmosets maintain high social attention through "peripheral monitoring" and proximity-dependent joint gaze, familiar pairs exhibit sophisticated, distance-independent coordination and gaze reciprocation.

      (3) Clarification of Male Interest: We explicitly stated that while male interest in females is high regardless of familiarity, it manifests as persistent monitoring in unfamiliar pairs versus a more aware, reciprocal state in familiar pairs.

      Minor comments:

      (1) Methods:

      a) Lines 522-539: The 200 continuous frames used for validation of the model containing two marmosets are sufficient to test how well it generalizes to other animals outside the training set? The RMSE reported, does it vary for animals inside vs outside the training set? To what extent does the RMSE, in image pixels, translate into accuracy in estimating the gaze direction, for example, as assessed by estimating error when marmosets look at images or video on the monitor?

      To address the reviewer’s concern regarding generalization and the translation of pixel RMSE to angular accuracy, we emphasize that the six facial features selected are prominent, high-contrast features across the species. Consequently, we observed that the RMSE remained consistent for marmosets both inside and outside the training set. To quantify how pixel-level tracking error translates into gaze estimation accuracy, we performed a sensitivity analysis. We simulated landmark (i.e., feature) jitter by sampling perturbations from circular distributions based on our empirical data (2.4 pixels for eyes; 2.1 pixels for the central blaze). Our results, illustrated in uthpr response image 1, show that 90% of the resulting head gaze deviations fall within 10°, which is consistent with the angular threshold used for our gaze cone model. This confirms that the reported RMSE provides sufficient precision for reliable gaze estimation.

      Author response image 1.

      Probability distribution of gaze angular deviation under circular perturbation. The histogram (blue) represents the change in reconstructed gaze angle (degrees) following stochastic perturbation of facial features. To simulate real-world variance, noise was sampled from circular distributions with radii of 2.4 pixels (eyes) and 2.1 pixels (central blaze). The red curve represents an exponential fit to the empirical data (y=ae<sup>bx</sup>, a=0.9591, b=0.1813. Approximately 90% of the reconstructed gaze deviations remain below 10°, indicating the model’s localised stability under pixel level coordinate jitter.

      b) Line 542-43: Is there any difference between a rigid model fit to the six facial points, versus using the plane defined by the two eyes and central blaze in terms of direction accuracy (in the ground truth validation)? How does the "semi-rigid" set of six points (mentioned also in lines 201-203) constrain the fit of the three points (two eyes and central blaze) that define the normal plan for the gaze cone?

      We thank the reviewer for the opportunity to clarify our geometric model. The plane used to define the gaze cone's origin was indeed determined by the two eyes and the central blaze. However, a plane defined by only three points was insufficient to determine a unique gaze direction, as the normal vector was ambiguous (it could point forward through the face or backward through the head).

      To resolve this, we utilized the relative positions of the two ear tufts. Because the tufts are anatomically situated behind the eyes and blaze, these additional points provide the necessary spatial context to orient the gaze vector correctly. In our validation, we found that the mouth does not alter the angular accuracy compared to a 3-point fit, supporting that the facial features are correctly identified.

      We use the term 'semi-rigid' to describe the six-point constellation because their relative spatial configurations remain stable across individuals and expressions, imposing a biological constraint on the model. This prevents unphysical warping of the face frame during 3D reconstruction and ensures the gaze cone remains anchored to the animal's true midline.

      (2) Results:

      a) Lines 203-205: What is the distinction between gaze orientation (defined by facial plane, 3D vector) and gaze direction (defined by ear tufts) ... is gaze direction in the 2D x-y plane? Why are two measures needed or different? It does not appear gaze orientation is used further in the manuscript and perhaps could be omitted.

      We appreciate the reviewer’s comment regarding the terminology. We have replaced all instances of ‘gaze orientation’ with ‘gaze direction’ to ensure consistency throughout the manuscript.

      To clarify, both terms referred to the same 3D unit vector. The ear tufts were not used to define a separate 2D measure; rather, they served as posterior anatomical anchors to resolve the 3D polarity of the normal vector (ensuring the vector points 'forward' from the face rather than 'backward'). Gaze direction was calculated in 3D space and was not restricted to a 2D x-y plane. We have clarified this in the revised Methods section (Lines 203–205) to avoid further ambiguity.

      b) Line 215-216: why is head-gaze velocity put in normalized units instead of degrees visual angle per second? How was the normalization performed (lines 549-557)? It would be simpler to see velocity as an angular speed (degrees angle per second) rather than a change in norms.

      We thank the reviewer for this suggestion. We agree that the expression is misleading.

      (1) We have replaced "face norm" with "face normal vector" (N) throughout the manuscript to clarify that we are referring to the 3D unit vector perpendicular to the facial plane.

      (2) Lines 224-225 and the corresponding Methods section (Lines 599-609) have been updated to reflect this change in units and terminology.

      We chose to use the change in the face normal vector in normalized units for our primary calculations because it allows for efficient spatiotemporal smoothing and is computationally robust at the very low thresholds required for our stability analysis. However, to address the reviewer's concern regarding interpretability, we have verified that our threshold of 0.05 normalized units corresponds to an angular velocity of 2.87 degrees/frame duration [33ms]. Since we are operating at very small angular changes, the Euclidean distance between unit vectors is a near-linear proxy for the angular displacement in radians.

      c) Lines 215-216: How do raw gaze traces appear over time ... are there gaze saccades and then stable fixations, or does it vary continuously? A plot of the gaze trace might be useful besides just showing velocity with a threshold, to evaluate to what extent stable fixation vs shifts are distinct.

      Author response image 2.

      Time course of gaze, angular velocity and stability, thresholding. The plot illustrates the temporal dynamics of the face normal vector velocity used to define stable gaze states. The blue trace represents the raw gaze velocity calculated in normalised units. The red dashed line demotes the empirical cut off threshold of 0.05 units per frame.

      To clarify the temporal dynamics of marmoset head movements, we have provided a representative time course of head gaze velocity as shown in Author response image 2. The data clearly show a "saccade-and-fixate" pattern: large, distinct spikes in velocity (representing rapid head redirections) are separated by periods of relative stability.

      While minor high-frequency fluctuations in the raw trace (blue) may be attributed to facial feature detection noise, they remain significantly below our stability threshold (red dashed line). By applying this threshold, we successfully isolated biologically relevant "stable fixations" from "head saccades," ensuring that our subsequent social gaze analysis is based on periods of intentional head gaze direction.

      d) Lines 237-286: The writing in this section does not emphasize the main results. There seem to be three takeaway points that could be emphasized better in the topic sentences of each of the paragraphs: i) Marmosets tended to spend most of their time on either end of the elongated box, not in the middle, ii) Males spent more time near the front of the box near the other animal than females, iii) Familiar pairs spent more time closer to each other.

      To address this comment, we have reorganized this section to lead with the three key behavioral findings:

      (1) We now state clearly in the topic sentence that marmosets preferred the ends of the arena over the middle.

      (2) We have highlighted the finding that males spend significantly more time near the inner edge (closer to the partner) than females, irrespective of familiarity.

      (3) We emphasized that familiar pairs maintain closer and more dynamic social distances over time, whereas unfamiliar pairs tend to move further apart as a session progresses.

      e) Line 303: It would be useful to see time traces of head velocity of each member of the pair and categorization over time of the gaze event types. A stable epoch must be brief on the order of 100-200ms. It is unclear how distinct the stable fixation epochs are from the moments when the gaze is shifting. Also, the state transition analysis treats each stable epoch like one event, and then following a gaze movement by either of the pair, the state is defined again, is that correct?

      We defined stable epochs as continuous periods where the face normal vector velocity remained below 0.05 normalized units for both animals. This ensures that a "gaze state" is only categorized when both marmosets have relatively fixed head orientations. As shown in the provided time traces in Author response image 2), the velocity profile is characterized by sharp peaks (head saccades) and clearly defined troughs (fixations). Further, we generated a probability histogram of stable head-gaze epoch durations (Author response image 3). The median duration of these stable epochs is 200ms, which aligns with biological expectations for fixation durations in primates and confirms that these states are distinct from the high-velocity shifts.

      The reviewer’s interpretation is correct. Our Markov chain model treats each stable epoch as a single event. A transition occurs when at least one animal moves (exceeding the velocity threshold), resulting in a new stable epoch where the relative gaze state is re-evaluated. This approach allows us to model the sequence of social interactions as a series of discrete behavioral decisions.

      Author response image 3.

      Temporal characteristics of stable gaze, head gaze, epochs. The histogram illustrates the probability distribution of the duration (ms) of stablegaze behaviour epochs. A minimum duration threshold of 100 ms was applied to exclude transient, non-purposeful head gazes.

      f) Lines 316-326: Some general summarizing statements to lead this paragraph would be useful. It seems that familiar pairs are more likely to participate in joint gaze, especially when close to each other, and perhaps, that males tended to gaze at females more than the reverse. Is there any notion that males were following the gaze of females?

      We thank the reviewer for these suggestions. We have revised the topic sentences of this section to lead with a summary of the social takeaways, specifically highlighting the higher level of male interest and the shift toward reciprocal coordination in familiar pairs.

      The reviewer correctly identified an important dynamic. Our transition analysis (Fig. 4D) confirms that males in both familiar and unfamiliar dyads frequently follow the female's gaze. This is evidenced by a robust transition probability (~17%) from "Male-to-Female Partner Gaze" (blue node) to "Joint Gaze" (green node). We found that this gaze-following behavior was a general feature of the dyads and did not differ significantly by familiarity, which is why it was not previously emphasized. However, we have now added a statement to the Results (Lines 358-365) to explicitly describe this male-led gaze-following behavior.

      g) Lines 328-337: Can these findings in this paragraph be summarized more generally? It seems males view unfamiliar females longer, whereas for familiar females they are more likely to reciprocate viewing if being viewed by them and then to join in joint gaze with them. Would that event, viewing a female and then a transition to joint gaze, not be categorized as a gaze-following event?

      We have now summarized the paragraph to emphasize the transition from vigilant monitoring in unfamiliar pairs to reciprocal awareness in familiar pairs.

      Regarding "longer" viewing: We have clarified the text to specify that males' interest in unfamiliar females is persistent and robust rather than simply "longer" in a single duration. The high recurrence probability signifies that males consistently re-orient their gaze back to the unfamiliar female even if the interaction is briefly interrupted by movement.

      Regarding gaze following and joint gaze: The reviewer asks if the transition from viewing a female to joint gaze constitutes gaze following. We agree that a transition from "male-to-female gaze" to "joint gaze" is indeed a gaze-following event (as noted in our previous response regarding Fig. 4D). However, the specific transition discussed in this paragraph (female-to-male gaze to male-to-female gaze) is different: it describes a "reciprocal" event where the male responded to being looked at by looking back at the female, while the female simultaneously shifted her gaze away. Since the two gaze cones did not intersect on an external object or on each other's faces simultaneously at the end of this transition, it was not categorized as joint gaze or gaze following.

      h) Lines 339-351: It is not clear why gazing at the region surrounding a female's face (as opposed to the face itself) reflects "gaze monitoring tied to increased social attention (Dal Monte et l, 2022). This hypothesis could be expanded to make the prediction clear in this paragraph.

      We thank the reviewer for identifying the need to clarify the hypothesis regarding the region surrounding the face. We have expanded this paragraph to explain why gazing at the peripheral facial region reflects social monitoring.

      In many primate species, direct and sustained eye contact can be often interpreted as a threat or a challenge, particularly between unfamiliar individuals. Peripheral monitoring (looking at the area immediately surrounding the face) can strategically allow an animal to stay highly attentive to the partner's head orientation, gaze direction, and facial expressions—all critical for anticipating future actions—while minimizing the risk of social conflict. By demonstrating that unfamiliar marmosets utilize this peripheral strategy significantly more than familiar ones, we provide evidence that social attention in novel dyads is characterized by a social monitoring strategy that balances the need for information with social caution.

      i) Lines 354-373: This section seems to suggest again that in a familiar male/female pair, the male is more likely to follow the female gaze and establish a joint gaze, and this occurs less with the unfamiliar pair only when closer in distance. Some summary sentences to begin the paragraph could help frame what to expect from the results.

      We have added summarizing topic sentences to this section to clarify the relationship between familiarity and the spatial distribution of joint gaze.

      (3) Discussion:

      Lines 380-463: This section reads more clearly than most of the results, where it is often hard to connect the data plots to their significance for behavior. Overall, I believe the manuscript could be improved by setting up a hypothesis before presenting results in the paragraphs demonstrating the data. Some of the main findings appear in text from lines 413-419 (somewhat hidden even in discussion).

      We sincerely appreciate the reviewer’s positive feedback on the clarity of the latter sections of our Discussion. We have taken the suggestion to heart and have performed a comprehensive restructuring of the Results and Discussion sections.

      (1) We have moved the key takeaways, specifically the distinction between vigilant monitoring in unfamiliar pairs and reciprocal coordination in familiar pairs, from the end of the Discussion to the topic sentences of the relevant Results paragraphs.

      (2) We established a unified framework throughout the manuscript that connects pixel-level tracking stability to the biological "saccade-and-fixate" movement pattern, and ultimately to the social dimensions of sex and familiarity.

      (4) A couple of additional questions to address in the discussion:

      a) Can you speculate why in this behavioral context the marmosets do not engage in reciprocal gaze where both are simultaneously looking at each other (lines 297-301)? How low is the incidence of this event, numerically, in comparison to the other events (1 in 1000 events, etc)?

      We appreciate the reviewer’s interest in the lack of reciprocal gaze (mutual eye contact).

      Numerically, reciprocal gaze events occurred with a frequency of approximately 1 in 500 social gaze events (comprising less than 0.2% of our social dataset). Given this extreme scarcity, we felt that any statistical comparisons across sex or familiarity would be underpowered and potentially misleading, leading to our decision to focus on partner and joint gaze states.

      We speculate that the rarity of reciprocal gaze is primarily due to our task-free experimental setup. Unlike directed cooperation tasks where animals must look at each other to coordinate actions for a reward (e.g., Miss & Burkart, 2018), our study focused on task-free interactions. In a free-moving context without a common goal, marmosets may prioritize monitoring the environment or the partner’s actions (joint or partner gaze) over direct, sustained mutual eye contact, which can sometimes be perceived as a confrontational or high-arousal signal in primate social hierarchies.

      b) Does a transition from a marmoset viewing their partner, to a joint gaze, count as a gaze-following event? It appears the authors are reluctant to use that terminology. What are the potential concerns in that terminology? Is there a concern that both animals orient to the same object that is salient to them without it being due to their gaze?

      A transition from a partner-directed gaze to a joint gaze is indeed a gaze-following event. We distinguish these events from a transition between partner-directed gazes (e.g., male-to-female to female-to-male). In these "reciprocation" cases, once the second animal looked at the first, the first animal shifted their gaze away. Because the two gaze cones did not intersect on a common object at the end of the transition, I classified such events as a social exchange of attention rather than a coordinated gaze-following event.

      Reviewer #2 (Recommendations for the authors):

      I do have a few questions/points for clarification:

      (1) While your approach appears to be able to track head orientation when the face is occluded or turned away from the primary cameras, how was the accuracy of this validated? Since you have multiple cameras, it should be possible to make the estimate using the occluded cameras and then validate using the non-occluded ones.

      We appreciate the reviewer's comment regarding the validation of our tracking during partial occlusions.

      We wish to clarify that our system does not utilize "primary" vs "auxiliary" cameras. Rather, any two or more cameras that capture facial features with high confidence are used to triangulate the points into 3D space. Thus, the "primary" cameras are dynamically determined frame-by-frame based on the animal's orientation.

      To validate the accuracy of our 3D reconstruction during occlusions, we utilized a "projection-validation" approach. As demonstrated in Figure 2B (left panel), when the face is turned away from a specific camera, leaving only the back of the head visible, we used the facial features triangulated from the other non-occluded cameras and projected them onto the image plane of the occluded camera. The fact that these projected points aligned precisely with the expected (but hidden) anatomical landmarks confirms the global accuracy of our 3D model.

      We previously benchmarked this approach using a three-camera system where we triangulated coordinates via two cameras and successfully projected them onto the third camera's image plane with high accuracy. This ensures that even when a camera is "blind" to the face, the 3D position estimated by the rest of the array remains robust.

      (2) Marmosets, like other non-human primates, also look at other body postures for their social communication, though admittedly marmosets are far more likely to look others in the face than larger primates. The tail-raised genital displays come to mind. While the paper primarily focuses on shared vs deviant gaze, and I believe tracks not only the angle of viewing towards the target but also the distance from the face (please clarify if I am wrong), it would also be useful to know how often marmosets are looking at each other beyond just the face. This is particularly interesting if the gaze towards the partner varies depending on whether that partner was generally oriented towards the gazer, or not. For the joint gaze, were there conditions in which the two were looking at the same target, but had body postures that were not oriented toward one another (i.e. looking at a distant target beyond one of the animals, like looking over someone else's shoulder)?

      We thank the reviewer for highlighting the importance of body postures and non-facial social signals (e.g., genital displays) in marmoset communication.

      At the inception of this project, we explored tracking multiple body parts. However, due to the marmoset's dense fur and the lack of distinct skeletal markers under naturalistic lighting, human annotators and early automated tools struggled to achieve the precision required for high-resolution 3D kinematics. While recent advances in whole-body tracking now make these questions approachable, we chose to focus on the face normal vector because it provided the most robust and high-confidence signal for social orientation in our current dataset.

      Regarding the "looking over the shoulder" scenario, we utilized a hierarchical classification system to prevent wrong categorization. Intersection with the partner’s face always took priority. If one animal’s gaze cone contained the other’s face, the state was classified as "Partner Gaze", even if the two gaze cones happened to intersect at a distant point in space. This ensures that "Joint Gaze" specifically captures instances where both animals ignore one another’s face regions to focus on a shared external target.

      We agree that the relationship between body posture and head gaze is a fascinating area for future research. In our current setup, while "Joint Gaze" requires the head-gaze cones to intersect, the animals' bodies could indeed be oriented in different directions (e.g., looking at a distant target behind the partner). We have added a note to the Discussion acknowledging that incorporating whole-body gestures would further deepen the understanding of marmoset social ethology.

      (3) In the introduction, (line 70), you raise the question of ecological relevance, using rhesus in laboratory settings. This could use a little more expansion/explanation of the limitations of current/past approaches.

      We thank the reviewer for the suggestion to expand upon the ecological limitations of traditional laboratory paradigms.

      We have substantially revised the Introduction (Lines 70–82) to provide a more detailed critique of past approaches. Specifically, we now highlight how traditional head-fixed or screen-based paradigms decouple eye movements from natural head-body dynamics and lack the reciprocal, multi-agent complexity found in real-world social environments (e.g., Land, 2006; Shepherd, 2010). By contrasting these constraints with the spatially and socially embedded nature of marmoset interactions, we clarify why a more naturalistic, quantitative approach is necessary to understand the true dynamics of social gaze. These additions provide a stronger theoretical foundation for our move toward a free-moving experimental model.

    1. eLife Assessment

      This important work examines the effects of side-wall confinement on chemotaxis of swimming bacteria in a shallow microfluidic channel. The authors present convincing experimental evidence, combined with geometric analysis and numerical simulations of simplified models, showing that chemotaxis is enhanced when the distance between the side walls is comparable to the intrinsic radius of chiral circular swimming near open surfaces. This study should be of interest to scientists specializing in bacteria-surface interactions.

    2. Reviewer #1 (Public review):

      The authors show experimentally that, in 2D, bacteria swim up a chemotactic gradient much more effectively when they are in the presence of lateral walls. Systematic experiments identify an optimum for chemotaxis for a channel width of ~8µm, a value close to the average radius of the circle trajectories of the unconfined bacteria in 2D. These chiral circles impose that the bacteria swim preferentially along the right-side wall, which indeed yields chemotaxis in the presence of a chemotactic gradient. These observations are backed by numerical simulations and a geometrical analysis.

    3. Reviewer #3 (Public review):

      This paper addresses, through experiment and simulation, the combined effects of bacterial circular swimming near no-slip surfaces and chemotaxis in simple linear gradients. The authors have constructed a microfluidic device in which a gradient of L-aspartate is established, to which bacteria respond while swimming while confined in channels of different widths. There is a clear effect that the chemotactic drift velocity reaches a maximum in channel widths of about 8 microns, similar in size to the circular orbits that would prevail in the absence of side walls. Numerical studies of simplified models confirm this connection.

      The experimental aspects of this study are well executed. The design of the microfluidic system is clever in that it allows a kind of "multiplexing" in which all the different channel widths are available to a given sample of bacteria.

      The authors have included a useful intuitive explanation of their results via a geometric model of the trajectories. In future work it would be interesting to analyze further the voluminous data on the trajectories of cells by formulating the mathematical problem in terms of a suitable Fokker-Planck equation for the probability distribution of swimming directions. In particular, this might help understand how incipient circular trajectories are interrupted by collisions with the walls and how this relates to enhanced chemotaxis.

      The authors argue that these findings may have relevance to a number of physiological and ecological contexts. As these would be characterized by significant heterogeneity in pore sizes and geometries, further work will be necessary to translate the present results to those situations.

    4. Author response:

      The following is the authors’ response to the previous reviews

      Public Reviews:

      Reviewer #1 (Public review):

      The authors show experimentally that, in 2D, bacteria swim up a chemotactic gradient much more effectively when they are in the presence of lateral walls. Systematic experiments identify an optimum for chemotaxis for a channel width of ~8µm, a value close to the average radius of the circle trajectories of the unconfined bacteria in 2D. These chiral circles impose that the bacteria swim preferentially along the right-side wall, which indeed yields chemotaxis in the presence of a chemotactic gradient. These observations are backed by numerical simulations and a geometrical analysis.

      Reviewer #3 (Public review):

      This paper addresses, through experiment and simulation, the combined effects of bacterial circular swimming near no-slip surfaces and chemotaxis in simple linear gradients. The authors have constructed a microfluidic device in which a gradient of L-aspartate is established, to which bacteria respond while swimming while confined in channels of different widths. There is a clear effect that the chemotactic drift velocity reaches a maximum in channel widths of about 8 microns, similar in size to the circular orbits that would prevail in the absence of side walls. Numerical studies of simplified models confirm this connection.

      The experimental aspects of this study are well executed. The design of the microfluidic system is clever in that it allows a kind of "multiplexing" in which all the different channel widths are available to a given sample of bacteria.

      The authors have included a useful intuitive explanation of their results via a geometric model of the trajectories. In future work it would be interesting to analyze further the voluminous data on the trajectories of cells by formulating the mathematical problem in terms of a suitable Fokker-Planck equation for the probability distribution of swimming directions. In particular, this might help understand how incipient circular trajectories are interrupted by collisions with the walls and how this relates to enhanced chemotaxis.

      The authors argue that these findings may have relevance to a number of physiological and ecological contexts. As these would be characterized by significant heterogeneity in pore sizes and geometries, further work will be necessary to translate the present results to those situations.

      Thanks to the referees' input and more work, we think our revised manuscript now meets the high standard of eLife

      Recommendations for the authors:

      The importance of the circular swimming chirality for the observed phenomenon could be further emphasized by actually using the word "chiral" or "chirality" in the text. Also indicating what would change is swimming were counterclockwise rather then clockwise would help the reader understand the key significance of chirality.

      We thank the reviewer for this insightful suggestion. We agree that the chirality of the surface interaction is central to the observed phenomenon and should be explicitly highlighted to improve the reader's understanding.

      In response, we have incorporated the terms "chiral" and "chirality" throughout the manuscript (Abstract, Introduction, Results, and Discussion) to emphasize this aspect. Furthermore, we have added a specific explanation in the Results section (the last paragraph of subsection “The cells in the right sidewall region dominated the chemotaxis of E. coli with lane confinements”) detailing the hypothetical scenario of counter-clockwise swimming. We clarify that in such a case, the hydrodynamic interaction would cause cells to veer left, resulting in up-gradient accumulation along the left sidewall rather than the right. We believe these additions significantly improve the clarity of the underlying physical mechanism.

      Reviewer #1 (Recommendations for the authors):

      I still have several comments that the authors may want to consider for the last version.

      - The run and tumble behavior of the cells at the surface remains puzzling and would need some more explanation in the text. Tumbles with no significant reorientation angle amount largely to smooth swimmers. How can a model based on run-and-tumbles be used to explain the difference between LSW and RSW?

      We apologize for the lack of clarity regarding the surface run-and-tumble behavior. While it is true that surface tumbles often result in smaller reorientation angles compared to bulk swimming, they are not negligible and play a critical role in the observed asymmetry. As shown in the tumble angle distributions (Fig. 2E and 2F), the probability of a tumble angle exceeding π/2 is approximately 9% for sidewall trajectories and 30% for the middle area. This tumbling behavior leads to differences between the left sidewall (LSW) and right sidewall (RSW) in two key ways:

      First, as detailed in our geometric analysis (Fig. 6), running cells following stable clockwise circular paths are geometrically favored to reach the RSW. Because cells moving up-gradient (towards the RSW) experience suppressed tumbling, they maintain these stable circular trajectories and accumulate effectively. Conversely, cells moving down-gradient (towards the LSW) experience enhanced tumbling. These frequent interruptions distort the circular trajectories required to reach the LSW, resulting in fewer bacteria entering the LSW compared to the RSW.

      Second, once at the wall, the difference in tumbling frequency dictates retention. Majority of LSW cells are swimming down-gradient (LSW-DG) and thus tumble more frequently, increasing their probability of escaping the wall. Majority of RSW cells are swimming up-gradient (RSW-UG), suppressing tumbles and increasing their residence time at the wall.

      The relevant clarifications have been included in the last paragraph of “Results” in the manuscript.

      - Figure 5B would need more explanation. I still don't understand the different behaviors for the right and left side walls at small widths. Is it noise really or a more complex behavior? Since most of these calculations are based precisely on the shape of these curves it would be useful to discuss them in more detail.

      We apologize for the lack of clarity. The behavior observed at small widths in Figure 5B is not noise; rather, it reflects the idealized nature of our simulation model.

      In the simulation, bacteria were modeled as active particles without explicit steric exclusion for the flagella and cell body. Consequently, simulated cells retain the ability to reorient and turn freely even in very narrow lanes (w ≤ 6 μm), allowing the geometric sorting mechanism (which favors the RSW) to function efficiently even at small widths. This is why the simulation shows a distinct difference between LSW and RSW proportions in this regime.

      In the experimental reality, however, the finite size of the bacterial body and flagella creates steric hindrance. In narrow channels, this physical constraint restricts the cells' ability to turn, thereby disrupting the circular swimming mechanism required to sort cells into the RSW. As a result, experimental data shows that the proportions of LSW and RSW cells tend to equalize in narrow channels (e.g., w = 6 μm in Fig. 4B), leading to a lower chemotactic drift velocity than predicted by the simulation.

      We have added a discussion regarding these steric effects and the deviation at narrow widths to the Results section (the penultimate paragraph of subsection "Simulation of E. coli chemotaxis within lane confinement") in the revised manuscript.

      - The importance of the chirality of the circular trajectories, although essential, remains insufficiently mentioned in the text.

      We have incorporated the terms "chiral" and "chirality" throughout the manuscript (Abstract, Introduction, Results, and Discussion) to emphasize this aspect. Furthermore, we have added a specific explanation in the Results section (the last paragraph of subsection “The cells in the right sidewall region dominated the chemotaxis of E. coli with lane confinements”) detailing the hypothetical scenario of counter-clockwise swimming.

      - It would be useful to color-code the trajectories of Figure 1B and alike with time.

      Thank you for the suggestion. Now the trajectories in Fig. 1B have been redrawn. Distinct colors denote individual trajectories, with color intensity darkening to indicate time progression.

    1. eLife Assessment

      This interesting study presents a multi-OMICs approach to unify different lines of evidence regarding the epigenetic regulation of the key virulence factor causing placental malaria during P. falciparum infection. Most results are confirmatory of previous observations; nonetheless, the claims are supported by convincing evidence. The combinatorial approach chosen here is unprecedented and therefore provides valuable new data. In addition, the comparative investigation of different DNA methylation modifications is novel and disproves a direct role in var gene regulation.

    2. Reviewer #2 (Public review):

      Summary:

      Dr Lenz and colleagues report on their in vitro studies comparing gene transcription and epigenetic modifications in Plasmodium falciparum NF54 parasites selected or not selected for adhesion of the infected erythrocytes (IEs) to the placental IE adhesion receptor chondroitin sulfate A (CSA).

      The authors report that selection led to preferential transcription of var2csa, the gene that encodes the VAR2CSA-type PfEMP1 well-established as the PfEMP1 mediating IE adhesion to CSA. They confirm that transcriptional activation of var2csa is associated with distinct depletion of H3K9me3 marks and that transcriptional activation is linked to repositioning of var2csa. Finally, they provide preliminary evidence potentially implicating 5mC in transcriptional regulation of var2csa.

      Strengths:

      The study confirms previously reported features of gene transcription and epigenetic modifications in Plasmodium falciparum.

      Weaknesses:

      No major new finding is reported.

      Comments on revisions:

      I suggest replacing the term "pregnancy-associated malaria (PAM)" with the more current and more precise term "placental malaria (PM)" throughout the manuscript.

      L. 59-60: "... shielding of the parasite antigens expressed on pRBC surfaces by leukocytes...". It is unclear to me what this means - I suggest a rephrasing for improved clarity.

      L. 144-6: Please provide a reference for the primary antibody reagent used.

    3. Reviewer #3 (Public review):

      Summary:

      The manuscript by Lenz et al. seeks to investigate molecular mechanisms directing virulence gene expression in the malaria parasite Plasmodium falciparum. The report provides a detailed characterization of the phenotypic and epigenetic features of a var2csa expressing parasite population, the key virulence gene causing the clinical syndrome of placental malaria. Novel evidence supporting the concept that active expression of this gene is associated with nuclear repositioning away from suppressive regions of chromatin is presented. In addition, the authors conducted a preliminary characterization of different forms of DNA methylation, suggesting that 5-methylcytosine is enriched in virulence genes, but does not correlate with their activation or repression. However, a trend towards higher enrichment of 5-methylcytosine in highly active as opposed to inactive genes from the core genome was reported, although this observation requires further validation.

      Strengths:

      The concise study provides a well documented and controlled set of experiments utilizing state-of-the-art OMICs methodologies including ChIPseq, RNAseq, chromatin-conformation capture (Hi-C) and DNA methylation (MeDIPseq) to generate deep insight into the epigenetic regulation of the key virulence factor of P. falciparum. The study unifies different lines of evidence and thereby contributes to a clearer understanding of the mechanisms underlying active expression of var2csa.

      Weaknesses:

      Although all experiments appear to have been rigorously conducted and documented with appropriate replicates and controls, the study is overall lacking statistical support from individual analyses of the biological replicates. In particular, the key novel result suggesting increased distance of the active var2csa gene from regions of heterochromatin as assessed by chromatin conformation capture would benefit from further analysis by comparison with other genetic loci. This also applies to the differential DNA methylation patterns, which should be dissected in more detail to support any association with gene expression or intron function.

    4. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      The manuscript by Lenz and colleagues describes a detailed examination of the epigenetic changes and alterations in subnuclear arrangement associated with the activation of a unique var gene associated with placental malaria in the human malaria parasite Plasmodium falciparum. The var gene family has been heavily studied over the last couple of decades due to its importance in the pathogenesis of malaria, its role in immune avoidance, and the unique transcriptional regulation that it displays. Aspects of how mutually exclusive expression is regulated have been described by several groups and are now known to include histone modifications, subnuclear chromosomal arrangement, and in the case of var2csa, regulation at the level of translation. Here the authors apply several methods to confirm previous observations and to consider a possible role for DNA methylation. They demonstrate that the histone mark H3K9me3 is found at the promoters of silent genes, var2csa moves away from other var gene clusters when activated, and while DNA methylation is detectable at var genes, it does not seem to correlate with transcriptional activation/silencing. Overall, the data and approach appear sound.

      Strengths:

      The authors employ the latest methods for epigenetic analysis of histone marks, transcriptomic analysis, DNA methylation, and chromosome conformation. They also use strong selection pressure to be able to examine the gene var2csa in its active and silent state. This is likely the only paper that has used all these methods in parallel to examine var gene regulation. Thus, the paper provides readers with confidence in the interpretation of independent methods that address a similar subject.

      We thank the reviewer for this positive assessment. We appreciate the recognition that our study combines complementary approaches including histone mark profiling, transcriptomic analysis, DNA methylation mapping, and chromosome conformation capture in parallel to the use of strong population selection that enables a controlled comparison of var2csa in active versus silent states. We agree that the convergence of independent methods strengthens confidence in the interpretation.

      Weaknesses:

      The primary weakness of the paper is that none of the conclusions are novel and the overall conclusions do not shed much new light on the topic of var gene regulation or antigenic variation in malaria parasites. The paper is largely confirmatory. The roles of H3K9me3 and subnuclear localization in var gene regulation are well established by many groups (including for var2csa), albeit in some cases using alternative methods. The only truly unique aspect of the manuscript is the description of 5mC at var2csa when the gene is transcriptionally active or silent. Here the authors demonstrate that the mark has no clear role in transcriptional activation or silencing, however, this will not be surprising to many in the field who have previously cast doubt on a regulatory role for this modification.

      While we agree that some individual features of var gene regulation, including H3K9me3 enrichment, have been described previously, our study integrate for the first time several layer of gene regulation on the clinically important var2csa locus using phenotypically homogeneous placental-binding parasite populations. As expected, var2csa activation coincided with a loss of H3K9me3 at the locus. However, using high-resolution chromatin conformation capture (to our knowledge, this experiment had never been applied to phenotypically homogeneous parasite populations), we quantified the repositioning of var2csa relative to heterochromatic telomeric clusters. We further assessed DNA methylation in this framework and show that 5-methylcytosine is broadly present at var genes and may correlate with transcript level, but is uncoupled from transcriptional activation, repression, and switching. Together, these findings integrate transcriptional state, chromatin marks, and 3D genome organization at var2csa and argue against models in which 5mC acts as a primary regulatory switch for var gene expression.

      Reviewer #2 (Public Review):

      Summary:

      Dr Lenz and colleagues report on their in vitro studies comparing gene transcription and epigenetic modifications in Plasmodium falciparum NF54 parasites selected or not selected for adhesion of the infected erythrocytes (IEs) to the placental IE adhesion receptor chondroitin sulfate A (CSA).

      The authors report that selection led to preferential transcription of var2csa, the gene that encodes the VAR2CSA-type PfEMP1 well-established as the PfEMP1 mediating IE adhesion to CSA. They confirm that transcriptional activation of var2csa is associated with distinct depletion of H3K9me3 marks and that transcriptional activation is linked to repositioning of var2csa. Finally, they provide preliminary evidence potentially implicating 5mC in the transcriptional regulation of var2csa.

      Strengths:

      The study confirms previously reported features of gene transcription and epigenetic modifications in Plasmodium falciparum.

      As stated in our response to Reviewer 1, our study combines, for the first time, complementary approaches, including transcriptomic analysis, histone mark profiling, DNA methylation mapping, and chromosome conformation capture, together with strong population selection to enable a controlled comparison of var2csa in active versus silent states.

      Weaknesses:

      No major new finding is reported. The strength of the evidence presented is mostly solid, although certain elements, e.g., the role of 5mC in transcriptional regulation of var2cs, appear preliminary and incomplete.

      While we agree that no major new finding is reported, we were able to use for the first time a high-resolution chromatin conformation capture method to quantify the repositioning of var2csa relative to heterochromatic telomeric clusters. We also further assessed that 5-methylcytosine is present at var genes and may correlate with transcript level, but is uncoupled from transcriptional activation, repression, and switching. Together, these findings integrate for the first time transcriptional state, chromatin marks, and 3D genome organization at var2csa and argue against models in which 5mC acts as a primary regulatory switch for var gene expression.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the Authors):

      (1) In the second paragraph of the introduction, the authors state "....such as the shielding of the parasite antigens expressed on pRBC surfaces by other cells and the evasion of splenic clearance (8)." What does "other cells" mean here?

      We thank the reviewer for this comment. We have clarified the cell type in the text.

      (2) In their interpretation of the Hi-C data, the authors conclude that the var2csa expressing parasites display "tighter heterochromatin control of var gene regions" and "interactions around other silent var genes were increased" and "an overall compaction of telomere ends and var gene-containing intrachromosomal regions". While the data appear to show that this is true when they compare the two parasite populations, I am concerned that the authors might be misinterpreting the data. It is important to note that the NF54CSAh line is heavily selected to be nearly entirely homogeneous for var gene expression while the NF54 line is exceptionally heterogeneous. This is shown in Figure 1G. Thus, any chromosomal arrangement specific for var gene expression in the unselected NF54 population will be similarly heterogeneous and therefore could appear less tight. In other words, interactions around silent var genes and overall compaction of telomere ends might be identical between individual parasites within these populations, but appear tighter or more compact in the var2csa expressing line simply because it is a homogeneous population. Perhaps this is what the authors meant to convey, however as currently written, it seems that they conclude the expression of var2csa results in a unique change in chromosome organization. A better comparison would be two populations homogeneously expressing different var genes, one expressing var2csa and one expressing an alternative var gene. Such lines can be generated through clonal isolation or selection for binding to a different host receptor.

      We thank the reviewer for this comment. The reviewer is correct, and we have revised the Discussion section of the manuscript to clarify this issue.

      (3) The title of the last section of the Results is "Distribution of DNA methylation influences gene expression overall but does not mediate transcriptional activation and switching in antigenic variation". This is an overstatement. The authors show that DNA methylation is absent at var gene promoter regions and enriched in coding regions, but there they provide no evidence that it "influences gene expression overall". This is speculation. Lastly, when the authors examined 5mC occupancy across genes, did they normalize for GC content of the DNA sequences? GC content is known to increase dramatically in coding regions (particularly in var genes) and thus could explain the distribution of this mark. If the authors corrected for this, they should directly state this in the results section. If they did not, they should explain why they don't think this property of the P. falciparum genome explains the distribution of 5mC.

      There is often a misconception in the field that DNA methylation is primarily confined to CpG islands in promoter regions and functions mainly as a repressor of transcription. However, in contrast to promoter methylation, methylation within gene bodies is generally associated with higher levels of gene expression, suggesting a role in facilitating transcription elongation. Gene-body methylation can also repress internal promoters, thereby preventing spurious transcription initiation within the gene. In addition, it has been shown to influence alternative splicing by affecting RNA polymerase II elongation kinetics.

      We propose that, in Plasmodium, DNA methylation may be associated with priming genes for transcriptional activity rather than repressing transcription. Specifically, higher methylation levels may facilitate recruitment of the RNA polymerase II transcriptional machinery to enable transcription. In Figure 4B, we observe higher levels of DNA methylation in the first exon of highly expressed genes in both the NF54 and NF54CSAh lines. Interestingly, we also detect high levels of methylation across most introns of the var genes, introns that must be transcribed, cannot be degraded, and are essential for var gene regulation, suggesting a possible sequence-recognition function. We have edited the manuscript to improve clarity.

      (4) In the legend to Figure 3D, the authors state that the centromeres are shown in blue, however in the figure they appear to be grey while var2csa is blue.

      We have revised the figure legend accordingly.

      Reviewer #2 (Recommendations For The Authors):

      I recommend using the term "transcription" rather than "expression" when discussing events at the gene level.

      We have revised the manuscript accordingly.

      I also recommend using the term "adhesion" to describe the physical interaction between infected erythrocytes and adhesion receptors rather than adherence", which should be reserved to describe non-physical affinity (e.g., beliefs, faith).

      We have revised the manuscript accordingly.

      Important new evidence regarding transcriptional regulation of var genes in general and var2csa in particular should be discussed and cited.

      We have revised the manuscript accordingly.

    1. eLife Assessment

      This important research investigates the precision of numerosity perception in two types of tasks and concludes that human performance aligns with an efficient coding model optimized for current environmental statistics and task goals. The proposed model receives compelling evidence from two numerosity perception experiments and a reanalysis of an existing dataset of risky decision-making. These findings have theoretical implications for our understanding of numerosity perception and decision-making as well as the ongoing debate on different efficient coding models.

    2. Reviewer #3 (Public review):

      Summary:

      This work investigates whether human imprecision in numeric perception is a fixed structural constraint or an endogenous property that adapts to environmental statistics and task objectives. By measuring behavioral variability across different uniform prior distributions in both estimation and discrimination tasks, the authors show that perceptual imprecision increases sublinearly with prior width. They demonstrate that the specific exponents of this scaling (1/2 for estimation and 3/4 for discrimination) can be derived from an efficient-coding model, wherein decision-makers optimally balance task-specific expected rewards against the metabolic costs of neural coding. The revised manuscript expands this framework to accommodate logarithmic representations and validates the core model against an independent dataset of risky choices.

      Strengths:

      The authors have effectively addressed my previous concerns with rigorous additions:

      (1) The mathematical formulation has been revised into a discrete signal accumulation framework, making the objective function and resource trade-offs much more transparent and mathematically tractable.

      (2) The incorporation of the logarithmic representation resolves prior ambiguities regarding structural constraints.

      (3) The new split-half analysis effectively addresses the temporal dynamics of adaptation. The stability of the sublinear scaling across the experiment provides solid evidence that human subjects utilize rapid, top-down modulation to adjust their encoding strategy when explicitly informed about the environment.

      (4) Validating the derived scaling exponents on an independent risky-choice dataset robustly supports the generalizability of the theoretical framework beyond a single cognitive domain.

      Comments on revisions:

      The authors have addressed my remaining theoretical concern regarding the model's predictions for mean estimation bias. I have no further comments.

    3. Author response:

      The following is the authors’ response to the previous reviews

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      The "number sense" refers to an imprecise and noisy representation of number. Many researchers propose that the number sense confers a fixed (exogenous) subjective representation of number that adheres to scalar variability, whereby the variance of the representation of number is linear in the number.

      This manuscript investigates whether the representation of number is fixed, as usually assumed in the literature, or whether it is endogenous. The two dimensions on which the authors investigate this endogeneity are the subject's prior beliefs about stimuli values and the task objective. Using two experimental tasks, the authors collect data that are shown to violate scalar variability and are instead consistent with a model of optimal encoding and decoding, where the encoding phase depends endogenously on prior and task objectives. I believe the paper asks a critically important question. The literature in cognitive science, psychology, and increasingly in economics, has provided growing empirical evidence of decision-making consistent with efficient coding. However, the precise model mechanics can differ substantially across studies. This point was made forcefully in a paper by Ma and Woodford (2020, Behavioral & Brain Sciences), who argue that different researchers make different assumptions about the objective function and resource constraints across efficient coding models, leading to a proliferation of different models with ad-hoc assumptions. Thus, the possibility that optimal coding depends endogenously on the prior and the objective of the task, opens the door to a more parsimonious framework in which assumptions of the model can be constrained by environmental features. Along these lines, one of the authors' conclusions is that the degree of variability in subjective responses increases sublinearly in the width of the prior. And importantly, the degree of this sublinearity differs across the two tasks, in a manner that is consistent with a unified efficient coding model.

      Comments on revisions:

      The authors have done an excellent job addressing my main concerns from the previous round. The new analyses that address the alternative model of "no cognitive noise and only motor noise" are compelling and provide quantitative evidence that bolsters the paper's overall contribution. The authors also went above and beyond by reanalyzing the Frydman and Jin (2022) dataset to provide new and very interesting analyses that provide an additional out of sample test of the model proposed in the current paper.

      Reviewer #2 (Public review):

      Summary:

      This paper provides an ingenious experimental test of an efficient coding objective based on optimization as a task success. The key idea is that different tasks (estimation vs discrimination) will, under the proposed model, lead to a different scaling between the encoding precision and the width of the prior distribution. Empirical evidence in two tasks involving number perception supports this idea.

      Strengths:

      - The paper provides an elegant test of a prediction made by a certain class of efficient coding models previously investigated theoretically by the authors. The results in experiments and modeling suggest that competing efficient coding models, optimizing mutual information alone, may be incomplete by missing the role of the task.

      - The paper carefully considers how the novel predictions of the model interact with the Weber/Fechner law.

      Weaknesses:

      The claims would be even more strongly validated if data were present at more than two widths in the discrimination experiment (also noted in Discussion).

      Reviewer #3 (Public review):

      Summary:

      This work investigates whether human imprecision in numeric perception is a fixed structural constraint or an endogenous property that adapts to environmental statistics and task objectives. By measuring behavioral variability across different uniform prior distributions in both estimation and discrimination tasks, the authors show that perceptual imprecision increases sublinearly with prior width. They demonstrate that the specific exponents of this scaling (1/2 for estimation and 3/4 for discrimination) can be derived from an efficient-coding model, wherein decision-makers optimally balance task-specific expected rewards against the metabolic costs of neural coding. The revised manuscript expands this framework to accommodate logarithmic representations and validates the core model against an independent dataset of risky choices.

      Strengths:

      The authors have effectively addressed my previous concerns with rigorous additions:

      (1) The mathematical formulation has been revised into a discrete signal accumulation framework, making the objective function and resource trade-offs much more transparent and mathematically tractable.

      (2) The incorporation of the logarithmic representation resolves prior ambiguities regarding structural constraints.

      (3) The new split-half analysis effectively addresses the temporal dynamics of adaptation. The stability of the sublinear scaling across the experiment provides solid evidence that human subjects utilize rapid, top-down modulation to adjust their encoding strategy when explicitly informed about the environment.

      (4) Validating the derived scaling exponents on an independent risky-choice dataset robustly supports the generalizability of the theoretical framework beyond a single cognitive domain.

      Weaknesses:

      The methodological and theoretical issues raised in the first round have been thoroughly resolved, and the evidence supporting the claims regarding response variance is convincing.

      There is one remaining theoretical point that warrants discussion to provide a complete picture of the proposed generative model. The manuscript exquisitely models and predicts response variance (imprecision), but it remains largely silent on the closed-form predictions for the mean estimation (i.e., bias). Under the assumption of optimal Bayesian decoding combined with specific encoding schemes (e.g., linear vs. logarithmic), the model implicitly generates mathematical predictions for the subjects' mean estimates. Specifically, varying the scaling exponent (α) and the prior width (w) should systematically alter the predicted bias in different conditions.

      While fitting or explicitly explaining this mean bias is not strictly necessary for the core claims regarding variance scaling, acknowledging what the optimal decoder analytically predicts for the mean estimation-and how it aligns or contrasts with typical empirical observations-would strengthen the theoretical transparency of the paper.

      We thank the reviewers for their attention to our revised manuscript. We are very glad that the reviewers seem satisfied with how we have addressed their concerns. The paper is now stronger than in its first iteration.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      I have no further requests for the authors, I congratulate the authors on a great paper.

      Reviewer #2 (Recommendations for the authors):

      No further suggestions.

      Reviewer #3 (Recommendations for the authors):

      In the Figure 2b caption, the phrase "from which the numbers of dots are sampled" appears to be a typo carried over from the estimation task. It should likely read "from which the numbers are sampled", as the discrimination task uses Arabic numerals rather than dot arrays.

      We thank the reviewers for their attention to our revised manuscript. We are very glad that the reviewers seem satisfied with how we have addressed their concerns. The paper is now stronger than in its first iteration.

      Reviewer #3 points out that we have focused on the subjects’ response variability, and we did not report the mean estimates. We agree that the reader could reasonably expect to see this. We now include this in Figure 6.

      The subjects exhibit the typical patterns observed in numerosity-estimation task (most notably, the ‘central tendency of judgment’). The dotted line shows the predictions of the best-fitting model (with 𝛼 = 1/2) with the logarithmic encoding, which reproduces the subjects’ main behavioral patterns.

      We have slightly revised the manuscript. The revised version includes this Figure, in Methods (p. 28). We have modified the text of the Methods accordingly (bottom of p. 27), and we now refer to this analysis in the main text (line 6 of p. 5). We have also corrected the typo noted by Reviewer #3 (caption of Fig. 2b).

    1. eLife Assessment

      This valuable study is an approach to integrating and comparing single-cell genomics data across species. The evidence supporting the conclusions of this work is solid, and ANTIPODE presents an updated methodological approach to determining how gene expression at the cell-type level has evolved. Thus, ANTIPODE should provide broad utility to studies of comparative neurogenomics and be of use to neuroscientists and evolutionary biologists.

    2. Reviewer #1 (Public review):

      Summary:

      The integration of single-cell datasets across species is a powerful approach to understanding how cell types and patterns of gene expression have evolved. Current methods to perform such integrations require multiple steps: clustering, the integration itself, and downstream differential expression analysis. In this study, the authors describe a new approach, called ANTIPODE, that combines these steps by integrating deep learning with interpretable decoding and linear modeling. This method builds on previous deep learning approaches to dataset integration, namely SCVI and scANVI, that employ a variational autoencoder to model single-cell RNA-sequencing datasets. However, gene expression estimates from these previous methods are challenging to interpret due to non-linear decoding from the modeled latent space. ANTIPODE seeks to address this issue by using a single-layer decoder coupled to a linear model to estimate patterns of differential expression, e.g. differential expression by coexpression module, across cell types, etc.

      The authors apply their framework to a large single-cell RNA-seq dataset (~1.8M cells) containing cells from the central nervous systems of humans, macaques, and mice spanning in utero developmental time points. They identify a consensus set of cell clusters across each species. They find that ANTIPODE performs at least as well as SCVI in terms of species integration and batch correction. The authors demonstrate several use cases of this integrated approach by analyzing differential expression that correlates with gene structure, the evolution of expression differences in neuropeptide systems, and the anatomical and phylogenetic variation in neurodevelopmental timing.

      Strengths:

      ANTIPODE is a welcome addition to techniques that integrate large single-cell RNA-seq datasets across multiple species. The approach's simultaneous inference of cell clusters, integration manifolds, and differential expression should streamline analysis pipelines whose elements are often disjointed and sometimes work at cross purposes.

      Weaknesses:

      The authors note several limitations to their method that will be targets for future development. First, clustering "resolution" is inferred from the data and cannot be tuned as with other approaches. Second, because of the linear decoding, ANTIPODE does not accommodate combining datasets obtained from different modalities (e.g. single-cell with single-nucleus RNA-seq). Third, as currently implemented, ANTIPODE does not explicitly model phylogenetic relationships. However, the authors describe an extension that could enable this, enhancing the power of multiple species integrations. A weakness with the current manuscript is the organization and readability of the figures. The supplemental figures in particular need to be restructured and reformatted to increase their interpretability.

    3. Reviewer #2 (Public review):

      Summary:

      This work presents ANTIPODE, a bilinear generative model developed for the simultaneous integration and identification of cell types across species and developmental stages using single-cell RNA-seq data. ANTIPODE is inspired by scANVI, a well-established semi-supervised framework for single-cell transcriptomics. After describing its implementation, the authors use ANTIPODE to integrate data from 15 species comprising 1,854,767 cells. Then, the authors benchmark ANTIPODE against commonly used methods (scVI, Harmony, and Scanorama) using two snRNAseq datasets and report comparable or superior performance. They then return to the initial integrated dataset and analyse patterns of gene expression evolution. Finally, they leverage the model to study the "later-is-larger" concept, evaluating the relationship between gene expression, developmental timing and structure size and finding gene expression signatures of this concept.

      Strengths:

      A major strength of the paper is that ANTIPODE employs a bilinear decoding architecture, which produces more interpretable model parameters while performing at least as well as existing, more opaque nonlinear integration approaches.

      The authors demonstrate the utility of ANTIPODE by integrating single-cell mRNA sequencing data from mouse, macaque, and human brains and confirming general principles regarding developmental timing and cell-type-specific gene expression divergence.

      They also propose a conceptually interesting framework for studying gene expression evolution: instead of focusing solely on differentially expressed genes between homologous cell types, they jointly model gene expression across developmental states and species-specific divergence, allowing them to define and analyse four categories of differential expression.

      Finally, the authors' conclusions are well supported by the analyses presented, although these conclusions remain relatively conservative and reinforce already established principles.

      Weaknesses:

      A central weakness of the paper is its limited accessibility to a broad audience. Despite attempting to keep computational details in the supplement, the main text still uses substantial jargon, undermining the goal of providing an intuitive explanation of the model. The figures are also insufficiently annotated (e.g., colour schemes in Figure 2 heatmap, bubble plot details in Figure 3, entropy definition in Figure 3), and the figure legends are overly brief and lack essential information. I strongly recommend that the authors revise both text and figures to improve clarity and readability.

      Similarly, the materials and methods lack a lot of information about the implementation of the model, the statistical tests used, the calculations of entropy, etc.

      The study sits between tool development and biological discovery but does not fully commit to either. As a result, it cannot be evaluated as a full benchmarking study, yet it also does not provide new biological insights that are validated experimentally.

      Finally, the GitHub repository for ANTIPODE is not yet functional and lacks documentation or tutorials, making it impossible to assess usability or reproducibility.

    4. Author response:

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      The integration of single-cell datasets across species is a powerful approach to understanding how cell types and patterns of gene expression have evolved. Current methods to perform such integrations require multiple steps: clustering, the integration itself, and downstream differential expression analysis. In this study, the authors describe a new approach, called ANTIPODE, that combines these steps by integrating deep learning with interpretable decoding and linear modeling. This method builds on previous deep learning approaches to dataset integration, namely SCVI and scANVI, that employ a variational autoencoder to model single-cell RNA-sequencing datasets. However, gene expression estimates from these previous methods are challenging to interpret due to non-linear decoding from the modeled latent space. ANTIPODE seeks to address this issue by using a single-layer decoder coupled to a linear model to estimate patterns of differential expression, e.g. differential expression by coexpression module, across cell types, etc.

      The authors apply their framework to a large single-cell RNA-seq dataset (~1.8M cells) containing cells from the central nervous systems of humans, macaques, and mice spanning in utero developmental time points. They identify a consensus set of cell clusters across each species. They find that ANTIPODE performs at least as well as SCVI in terms of species integration and batch correction. The authors demonstrate several use cases of this integrated approach by analyzing differential expression that correlates with gene structure, the evolution of expression differences in neuropeptide systems, and the anatomical and phylogenetic variation in neurodevelopmental timing.

      Strengths:

      ANTIPODE is a welcome addition to techniques that integrate large single-cell RNA-seq datasets across multiple species. The approach's simultaneous inference of cell clusters, integration manifolds, and differential expression should streamline analysis pipelines whose elements are often disjointed and sometimes work at cross purposes.

      Weaknesses:

      The authors note several limitations to their method that will be targets for future development. First, clustering "resolution" is inferred from the data and cannot be tuned as with other approaches. Second, because of the linear decoding, ANTIPODE does not accommodate combining datasets obtained from different modalities (e.g. single-cell with single-nucleus RNA-seq). Third, as currently implemented, ANTIPODE does not explicitly model phylogenetic relationships. However, the authors describe an extension that could enable this, enhancing the power of multiple species integrations. A weakness with the current manuscript is the organization and readability of the figures. The supplemental figures in particular need to be restructured and reformatted to increase their interpretability.

      We thank this reviewer for their positive feedback regarding the utility of the model and how it may simplify challenging evolutionary analysis.

      We acknowledge that the figures are a bit difficult to read, and we will improve annotation and tidiness to make them more accessible to the reader.

      We have implemented changes for an ANTIPODE version 0.2 version which includes regression of gene expression differences on a phylogeny. We have updated the github with this “antipode.phylo” module. For this study, the 3 species case is equivalent for flat or phylogenetic regression, where for example mouse up is equivalent to primate down, so we will do not plan to redo the analyses in the text using this new version.

      We have already provided examples for running ANTIPODE on our own and public datasets (https://github.com/mtvector/scANTIPODE/tree/main/real_examples), as well as in-line documentation of classes and functions, however it is true that these may be insufficient information for new users. We will provide true explanatory tutorials for both to address the reviewer’s concerns. ANTIPODE version 0.1 is currently installable from either github or PyPI.

      Reviewer #2 (Public review):

      Summary:

      This work presents ANTIPODE, a bilinear generative model developed for the simultaneous integration and identification of cell types across species and developmental stages using single-cell RNA-seq data. ANTIPODE is inspired by scANVI, a well-established semi-supervised framework for single-cell transcriptomics. After describing its implementation, the authors use ANTIPODE to integrate data from 15 species comprising 1,854,767 cells. Then, the authors benchmark ANTIPODE against commonly used methods (scVI, Harmony, and Scanorama) using two snRNAseq datasets and report comparable or superior performance. They then return to the initial integrated dataset and analyse patterns of gene expression evolution. Finally, they leverage the model to study the "later-is-larger" concept, evaluating the relationship between gene expression, developmental timing and structure size and finding gene expression signatures of this concept.

      Strengths:

      A major strength of the paper is that ANTIPODE employs a bilinear decoding architecture, which produces more interpretable model parameters while performing at least as well as existing, more opaque nonlinear integration approaches.

      The authors demonstrate the utility of ANTIPODE by integrating single-cell mRNA sequencing data from mouse, macaque, and human brains and confirming general principles regarding developmental timing and cell-type-specific gene expression divergence.

      They also propose a conceptually interesting framework for studying gene expression evolution: instead of focusing solely on differentially expressed genes between homologous cell types, they jointly model gene expression across developmental states and species-specific divergence, allowing them to define and analyse four categories of differential expression.

      Finally, the authors' conclusions are well supported by the analyses presented, although these conclusions remain relatively conservative and reinforce already established principles.

      Weaknesses:

      A central weakness of the paper is its limited accessibility to a broad audience. Despite attempting to keep computational details in the supplement, the main text still uses substantial jargon, undermining the goal of providing an intuitive explanation of the model. The figures are also insufficiently annotated (e.g., colour schemes in Figure 2 heatmap, bubble plot details in Figure 3, entropy definition in Figure 3), and the figure legends are overly brief and lack essential information. I strongly recommend that the authors revise both text and figures to improve clarity and readability.

      Similarly, the materials and methods lack a lot of information about the implementation of the model, the statistical tests used, the calculations of entropy, etc.

      The study sits between tool development and biological discovery but does not fully commit to either. As a result, it cannot be evaluated as a full benchmarking study, yet it also does not provide new biological insights that are validated experimentally.

      Finally, the GitHub repository for ANTIPODE is not yet functional and lacks documentation or tutorials, making it impossible to assess usability or reproducibility.

    1. eLife Assessment

      This manuscript identifies temperature-dependent alternative splicing of PIF4 in Arabidopsis thaliana and shows that heat stress promotes the accumulation of a short exon 5-skipping isoform that is predicted to encode a non-functional protein. This finding is important, and it provides an intriguing new layer of regulation for PIF4; however, the strength of the mechanistic conclusions is limited, and several key conclusions rely on indirect evidence. As a result, while the data robustly demonstrate heat-regulated alternative splicing of PIF4, the causal role of PIF4 isoforms' balance in shaping heat-induced developmental responses remains only partially supported and the strength of the evidence presented is incomplete. This work will be of interest to biologists working on alternative splicing.

    2. Reviewer #1 (Public review):

      This manuscript by Niño-González and collaborators shows that PIF4 undergoes alternative splicing in response to elevated temperature, generating distinct isoforms that may contribute to early seedling responses of Arabidopsis thaliana to heat stress (37 {degree sign}C). This work provides an intriguing perspective on how PIF activity may be modulated under stress conditions.

      The authors report rapid heat-induced changes in seedling morphology, with cotyledon angle and hypocotyl length altered as early as 3 hours after transfer to 37 {degree sign}C. These responses correlate with a transient increase in PIF4 transcript levels, followed by a return to control values at later time points. Notably, heat induces preferential production of an exon 5-skipping isoform of PIF4. The resulting short protein variant (PIF4-S) lacks part of the bHLH domain and is therefore unlikely to be transcriptionally active.

      To explore functional consequences, the authors expressed the exon 5 inclusion (functional) isoform, PIF4-L, in the pif4-101 mutant background. Some heat-induced phenotypes, such as protochlorophyllide accumulation and subsequent photobleaching, were reduced or absent in these lines. Interestingly, pif4-101 mutants themselves largely resemble WT plants for most heat-responsive traits, with the exception of hypocotyl length. PIF4-L expression specifically attenuates the cotyledon angle response to heat, without strongly affecting hypocotyl elongation.

      An important point is that PIF4 itself is not essential for the observed heat responses, as pif4 mutants respond largely like wild-type plants. This implies that the phenotypes described are likely controlled by multiple PIFs acting redundantly. In this context, the generation of the PIF4-S isoform may represent one of several mechanisms by which heat stress reduces overall functional PIF levels, rather than a PIF4-specific regulatory switch.

      Other caveats should be considered when interpreting the work. The functional relevance of the PIF4-S isoform under heat stress is not tested, as heat responses of these transgenic lines were not examined. Transcriptome analysis of heat-stressed WT, pif4-101 mutant, and PIF4-L-expressing plants revealed an enrichment of PIF-regulated genes, supporting a possible role for this family of transcription factors in the heat stress response. Notably, the heat responsiveness of the mutant and of the transgenic lines differs only marginally from that of WT plants. In addition, the study relies primarily on total transcript-level analyses, without quantitative assessment of individual PIF isoforms or direct measurement of PIF protein abundance. Given that other PIFs are also expressed and may be subject to alternative RNA processing, it needs to be determined whether PIF4-S alone could exert a dominant effect, counteracting all the other functional PIFs by itself, under heat stress. Hence, the proposed model is a plausible but still incomplete framework that requires further experimental validation and analysis.

      Altogether, the results presented in this manuscript could also be interpreted as follows: multiple PIFs contribute to the observed phenotypes in response to heat, with overlapping (redundant) functions. Heat stress may reduce functional PIF levels through different mechanisms, one of which is the regulation of alternative splicing, as shown here for PIF4, leading to the production of non-functional proteins or protein variants that could act as negative competitors (such as PIF4-S). Restoring PIF levels to values of control conditions could therefore reverse heat-induced phenotypes, as observed in the PIF4-L expression lines.

      Main concerns:

      (1) The existence of a shorter isoform of PIF4 and PIF6 is relevant, and PIF4 could indeed play a role in the context of heat stress, as it does in thermomorphogenesis. In this sense, the interplay between PIF4-S and PIF4-L might be linked to plant morphological responses to heat; however, the present work requires further investigation to determine whether this is indeed the case. It is important to note that pif4 mutants behave similarly to WT plants, indicating that PIF4 is not necessary for the observed responses. These phenotypes are therefore most likely related to several PIFs rather than to one specific family member. The results obtained with the transgenic lines expressing PIF4-L or PIF4-S support this interpretation, as increasing a functional PIF (PIF4-L) reduces some phenotypes, while expressing a dominant-negative version mimics heat-induced phenotypes under control conditions. Thus, it is reasonable to interpret that under heat stress, functional PIF levels are reduced through multiple mechanisms, alternative splicing and PIF4-S generation being one of them in the case of PIF4, but likely with additional effects on other family members. This clearly requires further study.

      (2) RT-qPCR quantification of total PIF4 transcripts, as well as the long and short isoforms under the tested conditions, is necessary. While we agree with the authors that PIF4-S could act as a dominant-negative factor, demonstrating this requires comparison of phenotypes under heat versus control conditions using the PIF4-S transgenic lines. Importantly, for the authors' hypothesis to be valid, PIF4-S must be able to outcompete other PIFs; therefore, accurate quantification of its expression levels across conditions is crucial. Combining the results shown in Figures 2A and Figure 2G suggests that the levels of the functional PIF4-L isoform are unchanged or even reduced after 3 h of heat treatment, as the increase in total PIF4 does not fully compensate for the diversion toward PIF4-S. Additionally, it would be equally relevant to quantify the expression of other PIFs (or at least those shown in Suppl. Fig. 6) to determine whether PIF4-S could exert such a strong effect even when expressed at relatively low levels. By "proper quantification", we refer specifically to functional protein-coding variants, as in the PIF4-L case. Supplemental Figure 6 shows that PIF3 and PIF5 appear unaffected by heat, while PIF1 expression is increased. However, JBrowse data for dark-grown seedlings indicate that PIF1 is subject to alternative transcription initiation, alternative splicing, and alternative polyadenylation at its 3′ end. A similar situation occurs for PIF3, at least at the 5′ end of the transcriptional unit. Therefore, alternative RNA processing mechanisms may play a key role in modulating functional PIF protein levels in response to heat. Without considering diverted isoforms of other PIFs, the interpretation becomes problematic, as PIF1 is upregulated by heat, and PIF4-S would therefore need to overcome its activity as well. This is particularly relevant given that the cotyledon angle phenotype at 37 {degree sign}C appears even stronger than in the pif1pif3pif5 triple mutant, if such a comparison is feasible.

      (3) In addition, PP2A is a well-established housekeeping gene for normalization across different light regimes, as its expression is not affected by light. However, we are not convinced this holds true under heat stress conditions (see Li et al., Plant Cell 2019 Jul 29;31(10):2353-2369. doi:10.1105/tpc.19.00519).

      (4) Furthermore, the mechanistic conclusions would be strengthened by directly assessing PIF protein levels, for example, by western blot analysis, to determine whether changes in transcript isoform abundance translate into corresponding changes in protein accumulation under heat stress.

      (5) Importantly, the authors' interpretation that "PIF4-L.1 expresses the long isoform at levels similar to those of WT plants (Supplemental Figure 9A), ruling out the possibility that the suppression of heat-induced phenotypes (cotyledon opening and Pchlide accumulation) is due to elevated PIF4 expression levels" is not correct. The RT-qPCR assay quantifies all isoforms containing exon 6, which include both long and short variants with respect to exon 5 inclusion. Since WT plants at 37 {degree sign}C express both isoforms (L/S ≈ 60/40), the PIF4-L lines actually express 2-4-fold higher levels of the functional PIF4 isoform, based on the values shown in the figures.

      (6) Figure 3B should include a statistical analysis, as it appears that PIF4-L expression does not significantly reduce photobleaching. Cotyledon angle is not affected by either the pif4 mutation or PIF4-L expression under 22 {degree sign}C conditions (Figure 3C). However, after 24 h at 37 {degree sign}C, there is a clear effect, with cotyledon angles closer to those observed in WT plants at 22 {degree sign}C. Regarding hypocotyl length, although statistical testing was not performed, it is evident that pif4-101 affects this parameter, while PIF4-L expression in this background does not substantially alter the mutant response.

      Other comments:

      (1) We do not believe that Figure 3E is an optimal way to demonstrate attenuation of transcriptional changes by PIF4-L expression in pif4 mutants. A heat map representation would likely be more direct and informative.<br /> The authors should consider expressing another functional PIF in the pif4 mutant background to determine whether the observed effects are specific to PIF4, as proposed, or whether they reflect a general PIF function.

      (2) It would also be informative to examine the response under Light + 37 {degree sign}C conditions. Since PIF4 mRNA accumulation is induced by light, the authors should test whether plants incubated in light show a similar response to heat or whether it is attenuated. Potential cross-regulation between light and heat responses would be worth exploring.

      (3) As the authors acknowledge in the introduction, most of our knowledge regarding PIFs in temperature signalling has focused on thermomorphogenesis. Therefore, we believe it is important to place these new findings (exon 5 skipping) within that framework, as they could help explain observations made under better-characterized conditions. In addition, would be interesting to see the phenotypes of the pifq mutant under heat stress. Even though this mutant line displays a heat-stress-like phenotype under control conditions, it may still respond to heat treatment. If so, this would indicate that PIFs are not fully determinative of this response.

      (4) The authors should clearly state the genetic background of the PIF4-S expression lines, which appear to be in the pif4-101 background but are not explicitly described as such in the manuscript.

    3. Reviewer #2 (Public review):

      The manuscript "Alternative splicing of PIF4 regulates plant development under heat stress" by Niño-González et al. describes a heat-responsive alternative splicing (AS) event in PIF4 in Arabidopsis and its potential impact on seedling development. The authors observe that etiolated ings exposed to heat respond with a more photomorphogenic developmental behaviour, as reflected, for example, by increased cotyledon opening and reduced hypocotyl elongation. They propose that the AS event in PIF4 may contribute to this response, due to reduced formation of the full-length PIF4 protein and an increase in the shorter PIF4 protein with potentially dominant negative functions.

      Expressing the individual variants in a pif4 mutant background was used to further examine their function. In the case of the full-length PIF4 variant, some of the heat-induced phenotypes were suppressed. For the lines overexpressing the shorter PIF4 variant, heat responses were not examined.

      The authors describe an interesting phenotype and present an appealing model of how AS of PIF4, a well-known key regulator of developmental processes including light- and temperature responses, might be involved. However, I don't think that the authors provide strong evidence for their model, and the unaltered heat response of pif4 mutants argues against a major role of this gene and its AS event under these conditions. Regarding the heat responses, it remains open how distinct those are from thermomorphogenesis.

      Weaknesses:

      (1) In the manuscript, it is emphasized that previous studies on PIFs' role in temperature responses have mainly focused on thermomorphogenesis under high ambient temperature and not under hot temperatures causing heat stress. How do the authors know that the effects they are looking at are specific to hot temperatures and do not also occur at more moderate temperature increases? So, what would PIF4 splicing look like upon a shift from 22{degree sign}C to 28{degree sign}C (instead of 37{degree sign}C as used in the manuscript)?

      (2) The potential role of PIF4 and its AS event in the heat response is the key point of this manuscript, as also reflected by the title. As summarized above, I don't see direct evidence for this and a functional characterization of the AS event is lacking. First, the pif4 mutant doesn't show an altered response, which argues against its requirement under these conditions, and in particular against the proposed model that a shortened version of PIF4 acts in a dominant negative manner. Second, the impact of AS on PIF4 protein levels remains open. Antibodies against PIF4 exist and have been used before, e.g. in Lee et al. (2021), Nat Comm, and Fan et al. (2025), Nat Comm - both studies address the role of PIF4 in thermomorphogenesis and should also be discussed in this manuscript. Detecting PIF4 proteins would allow testing if indeed both PIF4 protein variants are detectable and whether, upon heat stress, the longer variant decreases while the shorter variant increases. This could be expected based on transcript data; however, due to regulation at multiple steps, a correlation between transcript and protein levels might not exist. Third, the transgenic lines expressing either the short or long PIF4 variant do not really reflect the situation in the wild type and might be/are overexpression lines. Specifically, constructs for both variants lack the UTRs according to the description in the method section. Furthermore, is the short version expressed as GFP fusion, as I understood from the method description? The PIF4-L mutants have similar PIF levels as the WT (SFig. 9); however, this refers to total transcripts, which makes a difference in the wild type, in particular under heat stress. Comparing here only the PIF4-L levels would be more informative. Accordingly, the transgenic lines may overexpress PIF4-L compared to the wild type. All the PIF4-S lines show 4 to 5-fold overexpression (again for total transcripts) compared to WT. Including lines with lower overexpression levels would be needed for a direct comparison to the wild type. Moreover, immunoblot analysis of the PIF4 protein would be needed for a direct comparison between the wild type and the two types of mutants.

      (3) Apart from the question of what level of (over)expression the transgenic lines have, several aspects of the phenotyping experiments are not in line with a simple model of PIF4 regulation or have not been addressed. Expressing the long PIF4 variant in the pif4 mutant background suppresses some of the heat-induced changes, but not the hypocotyl shortening, suggesting that the hypocotyl effect is not caused by a heat-induced lack of PIF4.

      When expressing the short variant, the authors observe increased cotyledon opening in darkness, consistent with a suppression of skotomorphogenesis due to a negative function of PIF4-S, at least when it is overexpressed. For hypocotyl length, no consistent difference between wild type and PIF4-S lines was observed: seedlings grown for 3 d in darkness had identical lengths, for 4-d-old seedlings, the PIF4-S lines did not give consistent results: PIF4S.1 (which has highest transgene expression) had same length as wild type; a pronounced difference was only seen for PIF4-S.3, which is the line with lowest expression. Have the experiments been reproduced with independent seed badges? I'm also wondering why the authors haven't performed the heat stress experiments with these PIF4-S lines, as they did for the PIF4-L mutants. According to the authors' model, the PIF4-S lines might show an opposite response compared to the PIF4-L lines, i.e. an even more pronounced heat effect compared to the wild type.

      (4) Why was the heat effect on AS of PIF6 not further analysed? Previous work showed the role of PIF6 in seed development and germination; in line with this, PIF6 expression is particularly high in embryos and seeds, but it is also expressed and alternatively spliced in other tissues and conditions, as shown in Figure 1 and SFigure 2. From the data in Figure 1, it looks like the AS pattern in heat might also be different from other conditions. So, it would be interesting to see how AS of PIF6 changes in the control and heat samples that the authors analysed for PIF4 AS, in particular, if this response is distinct for PIF4 versus PIF6.

      (5) The presentation of the RNA-seq data is incomplete. According to the method section, WT, pif4-101, PIF4-L.1 and PIF4-L.2 seedlings upon 3 h heat/control treatment were analysed. Why are DE and DAS genes and comparisons of different genotypes not shown? The FC data displayed in Figure 2E and the overlap between heat-regulated genes (Fig. 3D; only in WT) and PIF regulation show only some aspects of the data.

    4. Reviewer #3 (Public review):

      Summary:

      PIFs play a pivotal role not only in light and temperature signaling pathways, but in many other signaling pathways regulating plant development by modulating transcription of a large number of genes both directly and indirectly. Similarly, alternative splicing (AS) plays a critical role in shaping the splice isoforms of thousands of genes under different environmental conditions to regulate plant development. In fact, AS of PIF6 has been shown to be involved in seed development. PIF4 is a central transcription factor integrating light and temperature signaling pathways. However, AS of PIF4 has not been involved in any pathways. This story first describes how AS of PIF4 is regulated by heat stress, and this regulation is involved in heat stress signaling to regulate plant development. This is an important finding of general interest.

      Strengths:

      The authors first describe AS of PIF4 is regulated by heat stress, and this regulation is involved in heat stress signaling to regulate plant development.

      Weaknesses:

      There are many loose ends in this story that need to be tied up.

      Major points:

      (1) The authors are showing only the AS transcripts by PCR, but no protein data. Given that the hypothesis is that the short form of PIF4 is functioning in a dominant negative fashion, the authors need to show that this short isoform expresses a protein. In addition, they need to show that this form is functioning in a dominant negative fashion with other PIFs, either by showing that this form reduces the DNA binding and/or transcriptional responses of other PIFs.

      (2) The two mutant alleles used for this study (pif4-100 and pif4-2) have T-DNA insertion after the AS exon. Do these alleles express any short version of the protein? The previous studies showed no protein production, and thus, they may not function as a dominant negative form. Usually, the T-DNA insertion alleles may express truncated transcripts, but many do not express any protein due to a lack of stop codon and/or degradation of the transcripts. But in this case, the mutants are behaving like WT. The authors need to show that these alleles are expressing a truncated version of the PIF4 protein.

      (3) Figure 4 shows phenotypes of independent lines expressing the PIF4 short version. The authors analyzed only the cotyledon and hypocotyl phenotypes, but not Pchlide or bleaching assays. The authors need to do a thorough phenotype analysis, including heat-stress phenotypes of these lines, to test if the data make sense with their hypothesis.

    5. Author response:

      We would like to thank the Editor and the three Reviewers for their detailed assessment of our manuscript and their constructive feedback. We found the suggestions valuable for refining our work. Before presenting the fully updated manuscript, we would like to clarify a few points in this initial response. This manuscript identifies a heat-induced, alternativelyspliced short isoform of PIF4 (PIF4-S) that contributes to the physiological responses observed in heat-stressed etiolated seedlings. First, we agree with all Reviewers that including PIF4 protein data will strengthen our findings an more definitely demonstrate the generation of a protein-coding alternative isoform under heat stress. Therefore, this will be one of our main priorities in the revision. Evidence for the functionality of this alternative isoform is clearly demonstrated by the distinct phenotypes exhibited by transgenic lines expressing either the long or the short versions of PIF4. Nevertheless, we agree that a more comprehensive characterization of these lines, as well as of the pif4 mutant lines, will further strengthen the demonstration of the functional relevance of this alternative splicing event. In addition, we will extend the phenotypic analysis of the PIF4-S lines to heat stress conditions. Importantly, the phenotypes observed in these lines suggest that additional molecular mechanisms may act in parallel with this alternative splicing event to regulate development in heat-stressed etiolated seedlings. As proposed by Reviewer #1, other PIFs may be involved in this response, and we will address this possibility. We will also provide new experimental data to show that alternative splicing in this gene is specific to heat stress and does not occur in other PIFs. Finally, we would like to clarify that the main scope of this manuscript is to demonstrate the functional relevance of the alternative isoform generated by splicing in PIF4 under heat stress. A detailed investigation of its molecular mode of action is beyond the scope of the present study. We sincerely appreciate the thoughtful feedback provided by all Reviewers. We will carefully consider their suggestions and use them to guide the inclusion of additional experiments and analyses in our revised manuscript to reinforce and clarify our conclusions.

    1. eLife Assessment

      The revised manuscript by Shukla et al. provides important mechanistic insights into kinesin-1 autoinhibition and cargo-mediated activation. Through a convincing integration of protein engineering, computational modeling, biophysical assays, HDX-MS, and electron microscopy, the study delineates how cargo binding induces an allosteric transition that propagates along the coiled-coil stalk to the motor domains, enhancing MAP7 engagement. The revisions substantially improve clarity, figure annotation, and methodological transparency, leaving the remaining limitations, primarily those inherent to conformational heterogeneity and resolution, appropriately acknowledged. Overall, the updated manuscript presents a coherent mechanism for kinesin-1 activation that will be of broad interest to the motor protein, structural biology, and cell biology communities.

    2. Reviewer #1 (Public review):

      The authors aim to interrogate the sets of intramolecular interactions that cause kinesin-1 hetero-tetramer autoinhibition and the mechanism by which cargo interactions via the light chain tetratricopeptide repeat domains can initiate motor activation. The molecular mechanisms of kinesin regulation remain a key question with respect to intracellular transport and this study adds important perspectives to our understanding. It has implications for the accuracy and efficiency of motor transport by different motor families, for example the direction of cargos in one or other direction on microtubules.

      The authors focus on the response of inactivated kinesin-1 to peptides found in cargos and the cascade of conformational changes that are induced. They also test the effects of the known activator of kinesin-1 - MAP7 - in the context of their model. The study benefits from multiple complementary, albeit relatively low-resolution, methods - structural prediction using AlphaFold3, 2D and 3D analysis of (mainly negative stain) TEM images of several engineered kinesin constructs, biophysical characterisation of the complexes, peptide design, hydrogen/deuterium-exchange mass spectrometry and simple cell-based imaging. Each set of experiments is carefully designed and the intrinsic limitations of each method are offset by other approaches, such that the assembled data convincingly supports the authors' regulatory model of kinesin activation.

      This study benefits from prior work by the authors on this system and the tools and constructs they previously accrued, as well as from other recent contributions to the field. This work will be of broad interest to cell and structural biologists, especially those seeking to tackle small and flexible macromolecular complexes, as well as biophysicists and those interested in protein engineering.

    3. Reviewer #2 (Public review):

      Summary:

      In this paper, Shukla, Cross, Kish, and colleagues investigate how binding of a cargo-adaptor mimic (KinTag) to the TPR domains of the kinesin-1 light chain, or disruption of the TPR docking site (TDS) on the kinesin-1 heavy chain, triggers release of the TPR domains from the holoenzyme. This dislocation provides a plausible mechanism for transition out of the autoinhibited lambda-particle toward the open and active conformation of kinesin-1. Using a combination of negative-stain electron microscopy, AlphaFold modeling, biochemical assays, hydrogen-deuterium exchange mass spectrometry (HDX-MS) and other methods, the authors show how TPR undocking propagates conformational changes through the coiled-coil stalk to the motor domains, increasing their mobility, and enhances interactions with the microtubule-bound cofactor MAP7. Together, they propose a model in which the TDS on CC1 of the heavy chain forms a "shoulder" in the compact, autoinhibited state. Cargo-adaptor binding, mimicked here by KinTag, dislodges this shoulder, liberating the motor domains and promoting MAP7 association, driving kinesin-1 activation.

      Strengths:

      Throughout the study, the authors use clever construct design - e.g. delta-Elbow, ElbowLock, CC-Di and the high-affinity KinTag - to test specific mechanisms by directly perturbing structural contacts or effecting interactions. The proposed mechanism of releasing autoinhibition via adaptor-induced TPR undocking is also interrogated with a number of complementary techniques that converge on a convincing model for activation that can be further tested in future studies.

      Weaknesses:

      These reflect limits of what the current data can establish rather than flaws in execution. It remains to be tested if the open state of kinesin-1 initiated by TPR undocking is indeed an active state of kinesin-1 capable of processive movement and/or cargo transport. It also remains to be determined what the mechanism of motor domain undocking from the autoinhibited conformation is. But this important study provides the groundwork for testing these open questions.

      Comments on revisions:

      My original minor concerns have been addressed in the revision.

    4. Reviewer #3 (Public review):

      Summary:

      The manuscript by Shukla and colleagues presents a comprehensive study that addresses a central question in kinesin-1 regulation-how cargo binding to the kinesin light chain (KLC) tetratricopeptide repeat (TPR) domains triggers activation of full-length kinesin-1 (KHC). The authors combine AlphaFold3 modeling, biophysical analysis (fluorescence polarization, hydrogen-deuterium exchange), and electron microscopy to derive a mechanistic model in which the KLC-TPR domains dock onto coiled-coil 1 (CC1) of the KHC to form the "TPR shoulder," stabilizing the autoinhibited (λ-particle) conformation. Binding of a W/Y-acidic cargo motif (KinTag) or deletion of the CC1 docking site (TDS) dislocates this shoulder, liberating the motor domains and enhancing accessibility to cofactors such as MAP7. The results link cargo recognition to allosteric structural transitions and present a unified model of kinesin-1 activation. I recommend acceptance of the manuscript subject to the following additions:

      Strengths:

      (1) The study addresses a fundamental and long-standing question in kinesin-1 regulation using a multidisciplinary approach that combines structural modeling, quantitative biophysics, and electron microscopy.

      (2) The mechanistic model linking cargo-induced dislocation of the TPR shoulder to activation of the motor complex is well supported by both structural and biochemical evidence.

      (3) The authors employ elegant protein-engineering strategies (e.g., ElbowLock and ΔTDS constructs) that enable direct testing of model predictions, providing clear mechanistic insight rather than purely correlative data.

      (4) The data are internally consistent and align well with previous studies on kinesin-1 regulation and MAP7-mediated activation, strengthening the overall conclusion.

      Weaknesses:

      (1) While the EM and HDX-MS analyses are informative, the conformational heterogeneity of the complex limits structural resolution, making some aspects of the model (e.g., stoichiometry or symmetry of TPR docking) indirect rather than directly visualized.

      (2) The dynamics of KLC-TPR docking and undocking remain incompletely defined; it is unclear whether both TPR domains engage CC1 simultaneously or in an alternating fashion.

      (3) The interplay between cargo adaptors and MAP7 is discussed but not experimentally explored, leaving open questions about the sequence and exclusivity of their interactions with CC1.

      Comments on revisions:

      The authors have addressed my comments satisfactorily.

    5. Author response:

      The following is the authors’ response to the original reviews

      eLife Assessment

      The manuscript by Shukla et al. provides important mechanistic insights into kinesin-1 autoinhibition and cargo-mediated activation. Using a convincing combination of protein engineering, computational modeling, biophysical assays, HDX-MS, and electron microscopy, the authors reveal how cargo binding induces an allosteric transition that propagates to the motor domains and enhances MAP7 binding. Despite limitations arising from conformational heterogeneity and structural resolution, the study presents a unified mechanism for kinesin-1 activation that will be of broad interest to the motor protein, structural biology, and cell biology communities.

      We are grateful for the time and effort from the reviewers and editors in providing fair and constructive comments that have helped to improve the manuscript. Our point-by-point response is provided below.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      The authors aim to interrogate the sets of intramolecular interactions that cause kinesin-1 hetero-tetramer autoinhibition and the mechanism by which cargo interactions via the light chain tetratricopeptide repeat domains can initiate motor activation. The molecular mechanisms of kinesin regulation remain an important question with respect to intracellular transport. It has implications for the accuracy and efficiency of motor transport by different motor families, for example, the direction of cargos towards one or other microtubules.

      Strengths:

      The authors focus on the response of inactivated kinesin-1 to peptides found in cargos and the cascade of conformational changes that occur. They also test the effects of the known activator of kinesin-1 - MAP7 - in the context of their model. The study benefits from multiple complementary methods - structural prediction using AlphaFold3, 2D and 3D analysis of (mainly negative stain) TEM images of several engineered kinesin constructs, biophysical characterisation of the complexes, peptide design, hydrogen/deuterium-exchange mass spectrometry, and simple cell-based imaging. Each set of experiments is thoughtfully designed, and the intrinsic limitations of each method are offset by other approaches such that the assembled data convincingly support the authors' conclusions. This study benefits from prior work by the authors on this system and the tools and constructs they previously accrued, as well as from other recent contributions to the field.

      Weaknesses:

      It is not always straightforward to follow the design logic of a particular set of experiments, with the result that the internal consistency of the data appears unconvincing in places.

      For example, i) the Figure 1 AlphaFold3 models do not include motor domains whereas the nearly all of the rest of the data involve constructs with the motor domains;

      We appreciate the reviewer’s comment regarding the absence of the motor domains in the AlphaFold3 models shown in Figure 1. These domains were intentionally excluded to improve visual clarity and to better highlight the interaction between the TPR domains and CC1 in the inhibited kinesin-1 conformation. We felt that this simplified presentation in the main figure helps readers focus on the key mechanistic advance introduced in this work at the outset of the paper. For completeness, we have provided full-length kinesin-1 AlphaFold3 models that include the motor domains in the Supplementary Information (Fig. S1), and they are described in detail in the main text. In addition, we have added a note to the Figure 1 legend to explicitly direct readers to these full-length models.

      ii) the kinesin constructs are chemically cross-linked prior to TEM sample preparation - this is clear in the Methods but should be included in the Results text, together with some discussion of how this might influence consistency with other methods where crosslinking was not used.

      Thank you. Chemical crosslinking is typically important for obtaining high-quality negative-stain TEM grids of kinesin-1 complexes and has been employed in all prior EM studies by our group and others. While this was described in the Methods, we agree that it should also be stated explicitly in the Results. Accordingly, we have added a sentence to the Results section noting that the proteins were stabilized using the amine-to-amine crosslinker BS3 (“Proteins were also stabilised using the amine-to-amine crosslinker BS3 that was important for achieving reproducibly high-quality samples for imaging.”).

      Please see point below for acknowledgement of risks of using crosslinker.

      Can those cross-links themselves be used to probe the intramolecular interactions in the molecular populations by mass spec?

      We had considered this, however, cross-linking mass spectrometry (XL-MS) has been applied extensively to essentially identical kinesin-1 complexes by Tan et al. (eLife 2023). That work provided important insights into the overall architecture of the complex, including the new head–CC1 interactions. However, as fully acknowledged by the authors, significant ambiguity remained with respect to the positioning of the TPR domains, with many cross-links that could not be straightforwardly rationalized in a single model. These unresolved aspects provided part of the motivation for the present study, as highlighted in the Introduction.

      We believe that this ambiguity likely reflects an underlying conformational equilibrium of the kinesin-1 complex (e.g. opening/closing transitions) and/or dynamic docking and undocking of the TPR domains, and lysine-rich features of the TPR domains (most notably the loops that connect the TPR alpha helices) which may make them prone to lock in non-native states, which limits the interpretability of static cross-linking data in this system. In this context therefore, we feel that XL-MS has already been thoroughly explored for kinesin-1 and that its practical limitations in resolving these TPR interactions have been reached.

      This consideration was a primary motivation for pursuing cross-linker-free, solution-based approaches, particularly HDX-MS, which we argue provide the most relevant new insights into the assembly and conformational dynamics of the complex. To make this rationale clearer, we have added an explicit note in the HDX-MS section emphasizing that this is a cross-linker-free method. The added text reads:

      “To determine how the local structural changes from adaptor binding and shoulder dislocation affected the dynamics of kinesin-1 complexes in solution, as directly and least invasively as possible, and without the risk of cross-linker artefacts.”

      In general, the information content of some of the figure panels can also be improved with more annotations (e.g. angular relationship between views in Figure 1B, approximate interpretations of the various blobs in Fig 3F, and more thought given to what the reader should extract from the representative micrographs in several figures - inclusion of the raw data is welcome but extraction and magnification of exemplar particles (as is done more effectively in Fig S5) could convey more useful information elsewhere.

      We appreciate these suggestions. We have modified the figures throughout the manuscript in line with the reviewer’s points. Raw data is now provided at higher magnification throughout so the reader can better distinguish individual particles, angular relationships have been added and further annotations provided on 2D class averages. We do not want the reader to draw too many conclusions from images of single closed particles (with the exception of open vs closed in Fig S7) as these require averaging and 2D classification to obtain meaningful insights, and so we have not added zoom panels in these cases. Figure 3F has been annotated as requested.

      Reviewer #2 (Public review):

      Summary:

      In this paper, Shukla, Cross, Kish, and colleagues investigate how binding of a cargo-adaptor mimic (KinTag) to the TPR domains of the kinesin-1 light chain, or disruption of the TPR docking site (TDS) on the kinesin-1 heavy chain, triggers release of the TPR domains from the holoenzyme. This dislocation provides a plausible mechanism for transition out of the autoinhibited lambda-particle toward the open and active conformation of kinesin-1. Using a combination of negative-stain electron microscopy, AlphaFold modeling, biochemical assays, hydrogen-deuterium exchange mass spectrometry (HDX-MS), and other methods, the authors show how TPR undocking propagates conformational changes through the coiled-coil stalk to the motor domains, increasing their mobility and enhancing interactions with the microtubule-bound cofactor MAP7. Together, they propose a model in which the TDS on CC1 of the heavy chain forms a "shoulder" in the compact, autoinhibited state. Cargo-adaptor binding, mimicked here by KinTag, dislodges this shoulder, liberating the motor domains and promoting MAP7 association, driving kinesin-1 activation.

      Strengths:

      Throughout the study, the authors use a clever construct design - e.g., delta-Elbow, ElbowLock, CC-Di, and the high-affinity KinTag - to test specific mechanisms by directly perturbing structural contacts or affecting interactions. The proposed mechanism of releasing autoinhibition via adaptor-induced TPR undocking is also interrogated with a number of complementary techniques that converge on a convincing model for activation that can be further tested in future studies. The paper is well-written and easy to follow, though some more attention to figure labels and legends would improve the manuscript (detailed in recommendations for the authors).

      Weaknesses:

      These reflect limits of what the current data can establish rather than flaws in execution. It remains to be tested if the open state of kinesin-1 initiated by TPR undocking is indeed an active state of kinesin-1 capable of processive movement and/or cargo transport. It also remains to be determined what the mechanism of motor domain undocking from the autoinhibited conformation is, and perhaps this could have been explored more here. The authors have shown by HDX-MS that the motor domains become more mobile on KinTag binding, but perhaps molecular dynamics would also be useful for modelling how that might occur.

      We are grateful for the reviewer’s comments. We agree that the weaknesses the reviewer has outlined define the limitations of the study and establish important priorities for future work, that includes molecular dynamics simulations. An important prerequisite for the latter is a starting model that one has confidence in. We think that our study and earlier work now provide a good experimentally supported foundation for using AF3 generated assemblies for this purpose, by ourselves and others.

      Reviewer #3 (Public review):

      Summary:

      The manuscript by Shukla and colleagues presents a comprehensive study that addresses a central question in kinesin-1 regulation - how cargo binding to the kinesin light chain (KLC) tetratricopeptide repeat (TPR) domains triggers activation of full-length kinesin-1 (KHC). The authors combine AlphaFold3 modeling, biophysical analysis (fluorescence polarization, hydrogen-deuterium exchange), and electron microscopy to derive a mechanistic model in which the KLC-TPR domains dock onto coiled-coil 1 (CC1) of the KHC to form the "TPR shoulder," stabilizing the autoinhibited (λ-particle) conformation. Binding of a W/Y-acidic cargo motif (KinTag) or deletion of the CC1 docking site (TDS) dislocates this shoulder, liberating the motor domains and enhancing accessibility to cofactors such as MAP7. The results link cargo recognition to allosteric structural transitions and present a unified model of kinesin-1 activation.

      Strengths:

      (1) The study addresses a fundamental and long-standing question in kinesin-1 regulation using a multidisciplinary approach that combines structural modeling, quantitative biophysics, and electron microscopy.

      (2) The mechanistic model linking cargo-induced dislocation of the TPR shoulder to activation of the motor complex is well supported by both structural and biochemical evidence.

      (3) The authors employ elegant protein-engineering strategies (e.g., ElbowLock and ΔTDS constructs) that enable direct testing of model predictions, providing clear mechanistic insight rather than purely correlative data.

      (4) The data are internally consistent and align well with previous studies on kinesin-1 regulation and MAP7-mediated activation, strengthening the overall conclusion.

      Weaknesses:

      (1) While the EM and HDX-MS analyses are informative, the conformational heterogeneity of the complex limits structural resolution, making some aspects of the model (e.g., stoichiometry or symmetry of TPR docking) indirect rather than directly visualized.

      We agree with the reviewers point. Conformational heterogeneity is a significant challenge, and the model has been developed from multiple complementary approaches. A higher resolution cryoEM study remains a priority, but is challenging because of the size, shape and flexibility of the particle, but we hope that some the approaches used here (e.g. nanobody TPR stabilisation, ElbowLock) will provide a path to achieve this.

      (2) The dynamics of KLC-TPR docking and undocking remain incompletely defined; it is unclear whether both TPR domains engage CC1 simultaneously or in an alternating fashion.

      We agree that this is a limitation. We strongly suspect that the TPR domains dynamic and are working to overcome experimental challenges to resolve this important outstanding question. We have expanded the discussion section to better highlight this important priority.

      (3) The interplay between cargo adaptors and MAP7 is discussed but not experimentally explored, leaving open questions about the sequence and exclusivity of their interactions with CC1.

      We agree that this is a limitation but will be an important priority for future studies.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      There are a number of places where the text could be more precise or clear, or the figures could be designed to be more informative:

      (1) The word "unitarily" is used in several places, and I don't know what it means in this context.

      We have changed the phrasing throughout the manuscript to this term. We were attempting to contrast with presumed cooperative multivalent interactions in the context of the kinesin-1 tetramer but agree that this choice of word doesn’t quite achieve that.

      (2) On page 5 the phrase "We focused on the ElbowLock background" is introduced and needs to be explained more clearly.

      Thank you. We have amended the text to read “This KIF5C construct contains a short 5 amino acid deletion that restricts flexibility around the elbow and helps maintain particles in their lambda conformation, providing homogenous samples, and facilitating subsequent analysis (34).”

      (3) On page 6, the phrase "To improve the resolution of our images, we turned to single-particle cryoEM analysis" is imprecise - what do the authors mean by the resolution of the images? Cryo-EM data does not always guarantee a higher resolution structure, but it offers the possibility of visualising finer structural features. This is probably what is meant here, but needs to be stated more precisely.

      We have amended the text to ‘visualise finer structural details’ as suggested.

      (4) Page 7 - "suggesting that TPR domains had loosely dissociated from the core" - I don't think the evidence points to dissociation of KLCs from the complex, but the phrase "loosely dissociated" implies this - would benefit from rephrasing.

      We have changed this to ‘undocked’ for consistency with other descriptions in the manuscript.

      (5) Was the effect of the CC-Di insertion (ΔTDS) detectable by AlphaFold prediction? It would be interesting to include this, partly for completeness and partly because a slightly imperfect and maybe a more dynamic coiled-coil in this region of the molecule may be important in supporting the conformational changes required for activation.

      Thank you for this suggestion. Modelling of deltaTDS complex indeed shows displacement of the TPR domains. In the standard 5 output models, the TPR domains now occupy a variety of different positions, all with essentially zero confidence (high position error). Consistent with biochemical data, the CCDi insertion is modelled with with no overall disruption to the architecture or length of CC1 as expected. We think that this is a valuable addition to the study and have included it as a new supplementary figure (Fig S5), with main text reading.

      …. “Supporting this, models of ΔTDS complexes using AF3 showed the expected seamless insertion of CCDi into CC1, with displacement of the TPR domains to a variety of different positions, in 5 models, all with high position error with respect to KHC (Fig S5).”

      (6) Figure S1 has two sections designated (C) in the legend.

      Corrected

      (7) Figure S3 - given the resolution and level of interpretation of the 3D reconstructions, it is not relevant to include an FSC curve, but other standard information, such as angular distribution and any evidence of variability from 3D classifications (and how many particles per 3D class) should be included for all structures.

      Thank you, a complete workflow for all complexes has now been provided in Figure S8 with the information requested. In each case there were typically two ‘good’ classes. For ElbowLock, this included one without a prominent shoulder, consistent with 2D classification and quantification. We assume this may reflect a docking/undocking equilibrium. For the deltaTDS and KinTag particles, neither class showed the shoulder feature. The main text has been modified to reflect this and reads “For ElbowLock complexes, this resulted in classes with and without a prominent shoulder, in agreement with 2D classification. For ElbowLock-ΔTDS and ElbowLock-KinTag complexes, no prominent shoulder containing classes were observed.”

      Reviewer #2 (Recommendations for the authors):

      Overall, the figures would benefit from more labels for clarity, some examples and suggestions below:

      (1) Figure 1A - Connect motors to the rest of the structure e.g., wiggly lines.

      Corrected.

      (2) Figure 1B - Add arrows and angles to indicate different views of the model.

      Corrected.

      (3) Figure 1B - Label TPR1-6 (e.g., inset zoom in).

      Corrected.

      (4) Figure 2D and 3D - Label the lack of a shoulder in all averages (perhaps with an arrow instead of a circle to not obscure density), include an example average which shows prominent shoulder density.

      Corrected. Full sets of classes showing shoulder like features for deltaTDS and KinTag complexes are now shown in Figure S4.

      (5) Figure 3D: Label motor domains and elbow as in other figures.

      Corrected.

      (6) Methods: Include more information on how EM classes were compared to AF projections (e.g., Figure 1D). Was this done visually or computationally? Likewise, more information is needed on how classes were judged to have prominent/weak shoulder density (Figure 2D). In the figure legend, there is a statement that "Full sets of classes are provided in Fig. S4" but this is absent in the supplement.

      Thank you. This information has been added to the methods.

      “For comparison to the AF3 model, simulated density was generated using the molmap command in ChimeraX (73) filtering to 15 Å, and projections were generated/selected automatically using the Reference Based Auto Selected 2D function in CryoSPARC”.

      Full sets of classes are now provided in Figure S4.

      (7) Figure 1-3 - Raw micrographs are a very useful inclusion but would benefit from being a more zoomed-in view (e.g., Figure S5 scale). Particularly useful for 3C, where the mixture of open and closed would be good to see.

      Higher zoom micrographs have been provided throughout.

      (8) Figure 5D: Panels too small to see the result, suggest making full width and moving E below.

      Thank you. We have expanded the panel and moved the model to a new Figure 6.

      (9) Figure S1: PAE plot convincing, but pLDDT colour models needed.

      A representative model coloured for pLDDT has been added to Figure S1. Most of the structure sits within the light blue confident range (90 > pLDDT > 70) with the exception of the disordered regions and neck coil.

      (10) Figure 5B: Reason for the variable inputs?

      The reviewer raises an interesting point. The slightly reduced expression of deltaElbow and slightly increased expression of ElbowLock is a consistent feature of these experiments. We note that this effect is in the ‘opposite direction’ to the impact on binding to MAP7 and so does not affect our conclusions from the experiment. However, we wonder whether opening and closing of the complex may impact on turnover of kinesin proteins, which could have implications for their normal homeostasis and possible degradation after transport in polarised cells. We are considering how to explore this going forwards. We have added a note to the results section to highlight this interesting observation to the reader.

      “We also noted slightly elevated expression of ElbowLock complexes and slightly lower expression of DeltaElbow complexes, suggesting that opening/closing of the complex could impact on kinesin-1 turnover”

      (11) Figure legend 5B: Insufficient detail, the end result is stated, but the three separate gels are not described.

      Legend has been expanded.

      (12) Figure 3F: Currently somewhat problematic. It is unclear if the models are in the same view, and so comparison is difficult. Figure 1C (bottom right) shows class averages with a clear, separate CC density, so the relatively featureless model in this region is puzzling. A statement on how the three model views are related to each other, if aligned with each other, would be useful.

      We appreciate the reviewers point. Models were aligned in Chimera, using the fit in map command. Because of the limited features of the models presumably due to flexibility, achieving a good alignment for all three models was challenging, but we think that showing the 180-degree rotations is probably about the best we can achieve here.

      (13) The following statement is too strong: "Nonetheless, we obtained reference-free 2D class averages that appeared to show full-length 'side' views of the complex with clear definition of the elbow, hinge 2, and KHC-KLC (coiled-coil) interface features which enabled us to identify CC1 confidently (Fig. 1D)". Given that the negative-stain EM data were collected primarily to validate the AlphaFold model, the assignment of CC1 should be described as consistent with rather than confidently identified from the class averages. The resolution of the EM data does not independently support such an assignment, and the wording needs to be softened.

      We appreciate the reviewer’s point, we have softened the wording as suggested. The paragraph now reads.

      “To visualise finer structural details, we turned to single-particle cryoEM analysis of frozen-hydrated samples. We were unable to obtain optimal samples suitable for determining the complete structure. Nonetheless, we obtained reference-free 2D class averages that appeared to show full-length ‘side’ views of the complex with clear definition of the elbow, hinge 2, and KHC-KLC (coiled-coil) interface features (Fig. 1D). The motor domains were poorly resolved in these classes, suggesting that the head assembly is somewhat flexible relative to the coiled coil/TPR body. A comparison to low-pass filtered back-projections from the AF3 model (without motor domains) revealed density at a position concurrent with the docked TPR domains (Fig. 1D).”

      (14) There is a typo in the figure legend of Figure 3 - (E) and (F) should be (F) and (G).

      Corrected

      Reviewer #3 (Recommendations for the authors):

      I recommend the following additions:

      (1) Figure 1 labeling - In panel A, please label the "linker domain" and the "KLC subunits" explicitly to help orient the reader. In panel B, please mark the "TPR shoulder" corresponding to the docked TPR domains on CC1; this will help the reader connect parts B and C.

      Thank you, we have modified Figure 1A with this additional information.

      (2) The TPR docking site (TDS) is a central structural element, and its sequence boundaries are provided in the Methods. It would help to visualize this directly in Figure 2A or in an inset.

      We hope that the reviewer agrees that the zoomed in model in Figure 5A (alongside MAP7) provides a sufficiently detailed view of the structural interface to highlight the orientation of TPR1 with respect to CC1. The side chain contacts in the model are very plausible and confidently predicted (and can be straightforwardly reproduced in AF3 using the sequence information provided in the methods), but as our study has not explored this interaction at the single residue level, we would prefer not to imply this to the reader at this stage.

      (3) The authors' model of cargo-induced TPR dislocation is convincing. However, the Discussion could benefit from a clarification on whether both KLC-TPR domains are expected to be bound simultaneously or if a dynamic exchange occurs, as the EM data suggest potential asymmetry.

      Thank you, please see point 5 below where we have modified the discussion to reflect the reviewer’s thoughtful comments.

      (4) The HDX-MS analysis is comprehensive, but the authors may want to briefly comment on the coverage of low-signal regions (especially within CC2-CC3) to enhance clarity.

      We have added an additional supplementary figure (S10) showing sequence coverage. Overall, this is 88% but with some lower coverage around KHC-CC0 (neck) and the acidic linker that connects the KLC coiled-coil to the TPR. We have added a note to the main text to reflect this.

      “Sequence coverage was high (overall 88%) with the exception of KHC-CC0 (neck coil) and the acidic-linker region that connects the KLC coiled-coil to the TPR domains where coverage was lower”

      (5) In the Discussion, the proposed interplay between MAP7 and cargo adaptors is intriguing, especially considering the results from Anna Akhmanova's lab showing that MAP7 activates kinesin-1 processivity. Do the authors suggest that competition for CC1 is mutually exclusive or sequential? The answer has mechanistic implications.

      We have been considering questions for some time, and the short answer is that we don’t fully understand the dynamics yet. However, we appreciate the reviewer’s prompt to clarify our thinking on this. We have attempted to do this in a revised discussion section where we more explicitly outline these outstanding questions.

    1. eLife assessment

      This manuscript provides an important contribution to the field of platelet biogenesis, and the convincing evidence will advance our understanding of signal transduction driving the development of late megakaryopoiesis and platelet reactivity that results in bleeding diathesis. The paper is noteworthy for analyzing two related, either singly or in combination, tyrosine phosphatases in this conditional, stage development gene knockout. Because SHP1 is a negative regulator and SHP2 is an activator, the synergistic effects found in the double knockout were surprising.

    2. Reviewer #1 (Public review):

      Barré et al. investigated the role of Shp1 and Shp2 in megakaryocytes (MKs) and platelets by conditional knock-out of Shp1, Shp2, or both under the control of the Gp1ba promoter. Deletion of Shp1 and Shp2 in MKs and platelets was almost complete. The Shp1/Shp2 double knock-out mice displayed macrothrombocytopenia and increased bleeding, whereas the single knock-outs did not show significant defects. Platelet function was aberrant in DKOs, but not in single knock-outs, and so was ligand-induced signaling, particularly Syk phosphorylation.

      Megakaryocyte maturation was impaired in Shp1/Shp2 DKO mice. Ligand-induced signaling was impaired in Shp2 knock-out and DKO. Ex vivo formation of platelets and in vivo maturation of MKs were impaired in DKO mice. Pharmacological inhibitors of Shp1 and Shp2 had largely similar effects as observed in the single knock-outs. The authors conclude that Shp1 and Shp2 have synergistic functions in the MK/platelet lineage, and that Shp2 may be a potential therapeutic target in myeloproliferative neoplasms.

      Strengths:

      The data clearly show effects of the Shp1/Shp2 double knock-out on MKs and platelets.

      Weaknesses:

      There appears to be a discrepancy between the results with the Shp2 single knock-out and the Shp2 inhibitor: the Shp2 knock-out does not affect MKs and platelets, except Erk1/2 signaling, whereas the Shp2 inhibitors appear to affect MK function.

      This work is interesting and may have potential from a therapeutic point of view.

    3. Reviewer #2 (Public review):

      Summary:

      In this manuscript, Barré et al. investigate the roles of the phosphatases Shp1 and Shp2 in the megakaryocyte and platelet lineage using genetic depletion in mice. By employing Gp1ba-Cre-based models, the study builds on the authors' previous work and addresses some limitations associated with earlier Pf4-Cre approaches. The authors report relatively mild alterations in megakaryocyte and platelet parameters in mice lacking either Shp1 or Shp2 alone, whereas combined deletion of both phosphatases results in macrothrombocytopenia, mild bleeding, and impaired GPVI-dependent platelet aggregation accompanied by reduced Syk phosphorylation. The functional platelet defects are linked to reduced expression of GPVI and integrin α2, while thrombocytopenia is associated with impaired megakaryocyte maturation, reduced ploidy, defective proplatelet formation, and altered TPO-dependent Ras/MAPK signaling. Similar effects on megakaryopoiesis are also observed in vitro following treatment with newly developed Shp2 inhibitors.

      Strengths and Weaknesses:

      The study addresses an important biological question and presents a substantial dataset that could contribute to a better understanding of Shp1 and Shp2 function in platelet biology. However, several aspects of data presentation and interpretation would benefit from additional clarification. In particular, while the authors conclude that single genetic deletion or pharmacological inhibition of Shp1 has a limited impact and that the major phenotypes are specific to combined Shp1/2 deletion or Shp2 inhibition, some of the data suggest more nuanced effects that may warrant further discussion.

    4. Reviewer #3 (Public review):

      Summary:

      In this manuscript, Barré et al utilize the Gp1ba-Cre transgenic mouse model to build upon previous findings in a Pf4-Cre system to investigate the effects of individual and combined Shp1 and Shp2 deletion in megakaryocytes and platelets. They report decreased megakaryocyte maturation, macrothrombocytopenia, and increased bleeding primarily in association with the Shp1/Shp2 double-knockout condition. The authors further show that this phenotype appears to be driven primarily by Shp2 and implicate dysregulation of Mpl signaling and downstream Ras/MAPK pathways, including ERK1/2. Given the key role of these pathways in human diseases such as myeloproliferative neoplasms and the challenges associated with modulating such a central pathway, identification of a specific regulator of Mpl signaling poses intriguing questions for future studies on clinical applicability.

      Strengths:

      Overall, the experiments combine in vitro, in vivo, and ex vivo approaches and appear to have been carefully designed and carried out, with multiple technical and biological replicates where relevant. The authors make a compelling argument for using the Gp1ba-Cre as opposed to the Pf4-Cre system and demonstrate both the dose- and stage-dependent effects of Shp1 and Shp2 on megakaryopoiesis and thrombopoiesis. They find that Shp1 and Shp2 are required in late-stage megakaryocyte maturation and that even low levels of expression compared to baseline are likely sufficient to yield generally normal megakaryocytes. Their findings also lead to specific future directions, such as the mechanism by which Shp1 regulates megakaryopoiesis and thrombopoiesis that is distinct from TPO-mediated signaling.

      Weaknesses:

      While the experiments have been thoughtfully designed and carried out, there is limited background explanation on relatively complex or niche pathways/mechanisms, such as the relationship between P-selectin, CRP, and PAR4p; the interactions between SFK, Syk, GPVI, and CLEC-2; and TPO, MPL, ERK1/2, AKT, and STAT3, which, while likely intuitive to experts in their respective fields, may be less obvious to a reader approaching this manuscript with a global interest in megakaryopoiesis/thrombopoiesis and thus detract from the impact of the findings.

      With regard to the science itself, some of the conclusions feel premature based on the available data.

      (1) The section "Aberrant ITAM signaling in Shp1- and Shp2-deficient platelets" is challenging to follow for those not well-versed in ITAM signaling and associated pathways, and may take additional outside reading to follow the conclusion that Syk-dependent signaling is modulated downstream of GPVI and CLEC-2 based on lack of change in Src p-Tyr418, especially considering that Src p-Tyr418 was previously introduced as a measure of SFK rather than Syk. In the introduction, Shp1 is specifically mentioned as a negative regulator of the ITAM/Syk/phospholipase pathway. However, in Figure 4Ai and Bi, Syk phosphorylation/activation in Shp1 knockout cells did not appear to be different from Shp2 knockout cells, and is lower than the control, which is surprising for a negative regulator. It is also not clear why, in the section (Figure 4A-B), there is reduced Syk activation in Shp1 and Shp2 single knockout cells upon CLEC2 stimulation (but apparently not with CRP) when there was no difference in response to CLEC2 (but a difference in response to CRP) in the previous section (Figure 3A, C).

      (2) In the section "Reduced Tpo signaling in Shp1/2-deficient MKs," only Western blot data for (p)ERK1/2, AKT, and STAT3 are presented before concluding that decreased ERK1/2 activity is a mechanistic explanation for thrombocytopenia seen in the Shp1/2 double-knockout condition. Such a statement would benefit from additional experiments, such as protein or transcriptional levels of ERK1/2 targets specifically relevant to megakaryopoiesis, such as ETS, FOS, and JUN, to assess the consequences of decreased phosphorylated ERK1/2.

      (3) Suggesting that "inhibiting Shp2 will not hav[e] any bleeding consequence in patients" and that Shp2 may be a therapeutic target in myeloproliferative neoplasms when none of these studies have been carried out in a human model is a bold conclusion. There are no data presented on, for example, whether Shp2 inhibition can help reverse the MPL/JAK/STAT pathway in the setting of gain-of-function mutations specifically associated with myeloproliferative neoplasms.

    5. Author response:

      eLife Assessment

      This manuscript provides an important contribution to the field of platelet biogenesis, and the convincing evidence will advance our understanding of signal transduction driving the development of late megakaryopoiesis and platelet reactivity that results in bleeding diathesis. The paper is noteworthy for analyzing two related, either singly or in combination, tyrosine phosphatases in this conditional, stage development gene knockout. Because SHP1 is a negative regulator and SHP2 is an activator, the synergistic effects found in the double knockout were surprising.

      We thank the reviewer for acknowledging the importance and novelty of our findings.

      Public Reviews:

      Reviewer #1 (Public review):

      Barré et al. investigated the role of Shp1 and Shp2 in megakaryocytes (MKs) and platelets by conditional knock-out of Shp1, Shp2, or both under the control of the Gp1ba promoter. Deletion of Shp1 and Shp2 in MKs and platelets was almost complete. The Shp1/Shp2 double knock-out mice displayed macrothrombocytopenia and increased bleeding, whereas the single knock-outs did not show significant defects. Platelet function was aberrant in DKOs, but not in single knock-outs, and so was ligand-induced signaling, particularly Syk phosphorylation.

      Megakaryocyte maturation was impaired in Shp1/Shp2 DKO mice. Ligand-induced signaling was impaired in Shp2 knock-out and DKO. Ex vivo formation of platelets and in vivo maturation of MKs were impaired in DKO mice. Pharmacological inhibitors of Shp1 and Shp2 had largely similar effects as observed in the single knock-outs. The authors conclude that Shp1 and Shp2 have synergistic functions in the MK/platelet lineage, and that Shp2 may be a potential therapeutic target in myeloproliferative neoplasms.

      Strengths:

      The data clearly show effects of the Shp1/Shp2 double knock-out on MKs and platelets.

      Weaknesses:

      There appears to be a discrepancy between the results with the Shp2 single knock-out and the Shp2 inhibitor: the Shp2 knock-out does not affect MKs and platelets, except Erk1/2 signaling, whereas the Shp2 inhibitors appear to affect MK function.

      This work is interesting and may have potential from a therapeutic point of view.

      Pharmacological effects do not always correlate with congenital anomalies arising for genetic defects. The Shp2 allosteric inhibitors used in our study only inhibit catalytically inactive Shp2, whereas targeted deletion of Ptpn11 results in a loss of total Shp2 expression, including catalytic and non-catalytic related functions, with developmental consequences. Further, Gp1ba-Cre+;Shp2fl/fl megakaryocytes express approximately 22% of normal Shp2 level, which likely also contributes to differences observed between pharmacological inhibition and genetic ablation of Shp2.

      We thank the reviewer for recognizing the therapeutic potential of our findings.

      Reviewer #2 (Public review):

      Summary:

      In this manuscript, Barré et al. investigate the roles of the phosphatases Shp1 and Shp2 in the megakaryocyte and platelet lineage using genetic depletion in mice. By employing Gp1ba-Cre-based models, the study builds on the authors' previous work and addresses some limitations associated with earlier Pf4-Cre approaches. The authors report relatively mild alterations in megakaryocyte and platelet parameters in mice lacking either Shp1 or Shp2 alone, whereas combined deletion of both phosphatases results in macrothrombocytopenia, mild bleeding, and impaired GPVI-dependent platelet aggregation accompanied by reduced Syk phosphorylation. The functional platelet defects are linked to reduced expression of GPVI and integrin α2, while thrombocytopenia is associated with impaired megakaryocyte maturation, reduced ploidy, defective proplatelet formation, and altered TPO-dependent Ras/MAPK signaling. Similar effects on megakaryopoiesis are also observed in vitro following treatment with newly developed Shp2 inhibitors.

      Strengths and Weaknesses:

      The study addresses an important biological question and presents a substantial dataset that could contribute to a better understanding of Shp1 and Shp2 function in platelet biology. However, several aspects of data presentation and interpretation would benefit from additional clarification. In particular, while the authors conclude that single genetic deletion or pharmacological inhibition of Shp1 has a limited impact and that the major phenotypes are specific to combined Shp1/2 deletion or Shp2 inhibition, some of the data suggest more nuanced effects that may warrant further discussion.

      We thank the reviewer for raising this point. The manuscript is being revised accordingly, including highlighting the potential role of Shp1 in megakaryopoiesis and thrombopoiesis under steady-state and stressed conditions, requiring more detailed investigation.

      Reviewer #3 (Public review):

      Summary:

      In this manuscript, Barré et al utilize the Gp1ba-Cre transgenic mouse model to build upon previous findings in a Pf4-Cre system to investigate the effects of individual and combined Shp1 and Shp2 deletion in megakaryocytes and platelets. They report decreased megakaryocyte maturation, macrothrombocytopenia, and increased bleeding primarily in association with the Shp1/Shp2 double-knockout condition. The authors further show that this phenotype appears to be driven primarily by Shp2 and implicate dysregulation of Mpl signaling and downstream Ras/MAPK pathways, including ERK1/2. Given the key role of these pathways in human diseases such as myeloproliferative neoplasms and the challenges associated with modulating such a central pathway, identification of a specific regulator of Mpl signaling poses intriguing questions for future studies on clinical applicability.

      We thank the reviewer for acknowledging the importance and novelty of our findings.

      Strengths:

      Overall, the experiments combine in vitro, in vivo, and ex vivo approaches and appear to have been carefully designed and carried out, with multiple technical and biological replicates where relevant. The authors make a compelling argument for using the Gp1baCre as opposed to the Pf4-Cre system and demonstrate both the dose- and stagedependent effects of Shp1 and Shp2 on megakaryopoiesis and thrombopoiesis. They find that Shp1 and Shp2 are required in late-stage megakaryocyte maturation and that even low levels of expression compared to baseline are likely sufficient to yield generally normal megakaryocytes. Their findings also lead to specific future directions, such as the mechanism by which Shp1 regulates megakaryopoiesis and thrombopoiesis that is distinct from TPO-mediated signaling.

      Weaknesses:

      While the experiments have been thoughtfully designed and carried out, there is limited background explanation on relatively complex or niche pathways/mechanisms, such as the relationship between P-selectin, CRP, and PAR4p; the interactions between SFK, Syk, GPVI, and CLEC-2; and TPO, MPL, ERK1/2, AKT, and STAT3, which, while likely intuitive to experts in their respective fields, may be less obvious to a reader approaching this manuscript with a global interest in megakaryopoiesis/thrombopoiesis and thus detract from the impact of the findings.

      We thank the reviewer for raising this point. The manuscript is being revised, to better explain the rationale and molecular mechanisms linking these pathways and functions.

      With regard to the science itself, some of the conclusions feel premature based on the available data.

      (1) The section "Aberrant ITAM signaling in Shp1- and Shp2-deficient platelets" is challenging to follow for those not well-versed in ITAM signaling and associated pathways, and may take additional outside reading to follow the conclusion that Syk-dependent signaling is modulated downstream of GPVI and CLEC-2 based on lack of change in Src p-Tyr418, especially considering that Src p-Tyr418 was previously introduced as a measure of SFK rather than Syk. In the introduction, Shp1 is specifically mentioned as a negative regulator of the ITAM/Syk/phospholipase pathway. However, in Figure 4Ai and Bi, Syk phosphorylation/activation in Shp1 knockout cells did not appear to be different from Shp2 knockout cells, and is lower than the control, which is surprising for a negative regulator. It is also not clear why, in the section (Figure 4A-B), there is reduced Syk activation in Shp1 and Shp2 single knockout cells upon CLEC2 stimulation (but apparently not with CRP) when there was no difference in response to CLEC2 (but a difference in response to CRP) in the previous section (Figure 3A, C).

      We thank the reviewer for raising these important points. The manuscript is being revised accordingly, including clarifying the roles of SFKs, Shp1 and Shp2 in the ITAM-Syk-PLCg2 signaling pathway.

      Briefly, SFKs are essential for phosphorylating ITAMs, allowing SH2-dependent docking of Syk. Reduced reactivity of Shp1/2 DKO platelets to CRP and collagen is likely due to downregulation of the ITAM-containing GPVI-FcR g-chain complex and integrin a2 subunit, and concomitant reduction in Syk phosphorylation.

      However, the marginal albeit significant reduction in Syk phosphorylation downstream of CLEC-2 in Shp1 and Shp2 KO platelets was not determined and was insufficient to impact CLEC-2-mediated platelet aggregation under the conditions tested.

      Differences in the stoichiometry and docking of Syk to phosphorylated GPVI-FcR g-chain and CLEC-2 likely contribute to the differences in platelet reactivity and Syk phosphorylation downstream of the two receptors in the absence of Shp1 and Shp2.

      (2) In the section "Reduced Tpo signaling in Shp1/2-deficient MKs," only Western blot data for (p)ERK1/2, AKT, and STAT3 are presented before concluding that decreased ERK1/2 activity is a mechanistic explanation for thrombocytopenia seen in the Shp1/2 doubleknockout condition. Such a statement would benefit from additional experiments, such as protein or transcriptional levels of ERK1/2 targets specifically relevant to megakaryopoiesis, such as ETS, FOS, and JUN, to assess the consequences of decreased phosphorylated ERK1/2.

      We thank the reviewers for these constructive comments. Further experiments are being planned to determine the biological and transcriptional consequences of reduced ERK1/2 phosphorylation during megakaryopoiesis and thrombopoiesis.

      (3) Suggesting that "inhibiting Shp2 will not have any bleeding consequence in patients" and that Shp2 may be a therapeutic target in myeloproliferative neoplasms when none of these studies have been carried out in a human model is a bold conclusion. There are no data presented on, for example, whether Shp2 inhibition can help reverse the MPL/JAK/STAT pathway in the setting of gain-of-function mutations specifically associated with myeloproliferative neoplasms.

      This conclusion is being tempered in the revised manuscript. Genetic- and pharmacological-based approaches will be used to establish the therapeutic potential of inhibiting Shp1 and Shp2 in mouse models of MPN, including Jak2 gain-of-function mice. Bleeding and thrombotic complications of inhibiting Shp1 and Shp2 will be explored as part of these studies.

    1. eLife Assessment

      This study provides valuable findings in the study of enhancer biology by identifying and dissecting a minimal enhancer regulating dlx2b expression during zebrafish tooth development, supported by promoter dissection, reporter assays, and genome-editing approaches. The work offers a resource and extends previous findings but has limited broader impact, with several conclusions about general cis-regulatory principles and functional consequences remaining only partially supported. Accordingly, the strength of evidence is at present incomplete, as additional functional validation would be needed to fully substantiate some of the claims.

    2. Reviewer #1 (Public review):

      Summary:

      Jackman et al report the analysis of a cis-regulatory region upstream of the dlx2b gene in zebrafish, that is hypothesised to control gene expression in the developing tooth. To demonstrate this, the authors performed solid promoter bashing analysis to assess the gene expression driven by the regulatory region, and validated the expression against a GFP-reporter knock-in. They narrowed down the tooth-specific enhancer activity to the MTE, which was sufficient to drive gene expression. Interestingly, they have identified a vertebrate conserved region which contained four predicted transcription factor binding sites, which when mutated individually, did not alter the reported gene expression. However, in combination, the expression was disrupted. The authors propose a putative upstream regulator cebpa binding one of the predicted TFBS, using in situ hybridisation to show overlapping gene expression domains.

      Strengths:

      The experiments presented in this paper were rigorously executed and the authors' effort to systematically dissect the different elements of the enhancer are commendable. The discussion and limitations of the study were very well-balanced.

      First, the results represent important findings first for the enhancer biology field to sustain evidence of the role of redundant TFBSs. Too often, only TFBS mutations that are sufficient and necessary to drive gene expression patterns are reported, but work providing evidence that some TFBS are necessary but not sufficient by themselves to drive expression is rarer. TFBS redundancy is a crucial concept in enhancer biology but also a difficult concept to prove that hinders the accurate prediction of enhancer function. In an era where increasingly more powerful machine learning models are developed to predict enhancer function, this work is a reminder of the complexity of enhancer biology and provides ground truths for experimental validation.

      Second, the results present valuable findings for the field of tooth development. While the authors have comprehensively described work performed in this space, there are still not many tooth-specific enhancers identified and accurately described. The work also presents further avenues for studying upstream regulators.

      Weaknesses:

      It seems to me that one of the greatest outcomes of this work is demonstrating the collective action of mutated TFBSs where individual mutations are not affecting gene expression. These findings fall into the realm of enhancer redundancy but this concept was not thoroughly discussed in the introduction of the paper.

      The claimed results are generally well-supported by the experiments performed, and hypothesis and speculations have been clearly stated. However, some speculative statements remain that should be addressed, for example in the abstract line 33 "These findings suggest that loss of MTE function permits alternative cis-regulatory elements to gain control of the promoter". There is no data indicating what these cis-regulatory elements could be, hence this sentence might be better suited in the discussion.

      The manuscript could be strengthened by further exploration of the wider region upstream of dlx2b to support the recruitment of other TFBSs: Were there any other vertebrate-conserved regulatory regions just outside of the MTE? Were there any other family members of the predicted TFs expressed in the tooth? Transcription factor binding sites identity remains a prediction; it could be expanded to other TFs within the same family.

    3. Reviewer #2 (Public review):

      The manuscript by Jackman et al. explores the role of a candidate enhancer of dlx2b in zebrafish tooth formation.

      They have mapped the dental epithelium and mesenchyme activity of a 4kb promoter proximal region previously identified as a candidate enhancer region. They identified candidate TFBS and candidate transcription factors regulating this enhancer and proposed that their findings reveal principles of enhancer function during vertebrate organogenesis (tooth development) and the power of dissecting cis regulatory architecture. The study offer valuable genetic tagging resource for studying tooth development while further experiments and analyses would be needed to support the suggestion for novel findings on in cis-regulatory principles of tooth development. In the lack of functional evidence on endogenous target gene pr tooth development, some of the claims of the paper may need rephrasing.

      (1) The candidate enhancer region has previously been published, this study narrows the enhancer effect to a well-conserved region within. To what degree the element is unique in the locus for tooth development and to what degree this element is required for tooth morphogenesis, is not addressed.

      (2) The knock-in approach is convenient for reporter activity based analyses, however it lacks the precision that would be necessary to conclude on enhancer- autonomous effects or enhancer effects on the endogenous target promoter. The HSP promoter inserted in within a 5kb(?) insert in the UTR region of dlx2b creates an chimeric E-P context. The expression profile of the knock-in reporter is substantially different from the endogenous gene (Figure 1B and C) suggesting E-P interaction dependent expression profile, which may confuse what in the expression comes solely from the enhancer and not as a result of the HSP promoter interaction with the enhancer. An alternative heterologous promoter would help in defining the enhancer specific effects.

      (3) Function of the candidate enhancer: The MTE enhancer effect is measured by gain of function towards dlx2b regulation. The deletion assays are limited to plasmids designed to test the enhancer in isolation from the endogenous enhancer architecture, or to a deletion in the knock-in, which may be impacted by the chimeric regulatory interaction with a heterologous HSP promoter. As a result we do not learn whether the enhancer targets or needs for endogenous target gene activity. This design allows a conclusion on tissue activity of the enhancer but not the requirement for tooth development.

      (4) Since the locus is scattered by candidate enhancers (see genome annotation resources) it is feasible that additional E-P interactions lead to potential enhancer redundancies with the MTE. For a conclusive functional test/requirement of the MTE enhancer, the authors would need to delete it in the endogenous locus context. The knock-in could theoretically be used for an enhancer function on dlx2b activity, if the authors show that there is interaction with the endgogenous promoter (3C type experiment); and that the MTE enhancer-driven GFP activity was identical to the endogenous tagged dlx2b activity. This does not appear to be the case, as ectopic expression in Fig 1C as compared to B is shown. Of note, RNA detection by WISH would be more precise for comparisons. The figure likely compares protein (legend is unclear, but text suggests protein) to mRNA, which is imprecise.

      (5) There is an experimental design question arising with generating the MTE deletion in the knock-in (line 391): the authors describe generating the transgenic lines by screening for reduced reporter activity first. This suggests the authors pre-emptively looked for an effect as result they predicted when generating the transgenic lines, which would create a circular argument. All transgenic lines carrying the deletion (tested by sequencing first) would need to be assayed for activity change and then can conclusion could be made on effect of MTE loss by statistical analyses of reporter activities in the generated lines.

      (6) Most transgenic work described are based on single transgenic lines. Enhancer promoter contexts may be affected either by position effects (in case of the reporter constructs) or by the heterologous promoter context of the knock which may be affected by unexpected recombination events. Such unintended confound effects can be excluded by replicates.

      (7) GFP protein detection does not allow precise spatio-temporal resolution due to varying protein stability in tissues, which potentially impacts endogenous gene activity comparison, and accurate determination of activity dynamics towards conclusions on lineage determining/maintenance roles of the dlx2b enhancer.

      (8) The expression pattern change upon MTE loss (retention of mesenchyme, loss of epithelium) is an interesting observation, which would benefit from more comprehensive analysis of the grammar (TFBS contributions) to the pattern variation by dissection of the combination of TFBS contributions. Without such, enhancer grammar remains mostly unclear, thus, principles of morphogenesis may not have been uncovered.

      (9) The diagrammatic models of the conclusions are illustrating simple logic which does not add to the text.

      (10) Author contributions need to be explained in more detail to be sufficiently granular for fair credit.

    4. Reviewer #3 (Public review):

      In the manuscript entitled "A Minimal tooth Enhancer Regulates dlx2b Expression During Zebrafish Tooth 1 Formation: Insights into Cis-Regulatory Logic in Organogenesis", the authors explore the cis-regulatory logic of a dlx2b minimal enhancer capable of directing dlx2b gene expression to the developing tooth germs. The study combines (1) CRISPR-mediated GFP knock-in to track endogenous gene expression; (2) a promoter-bashing approach to identify a minimal tooth enhancer (MTE); (3) site-directed mutagenesis coupoled with transgenesis to assess the individual role of conserved TF binding sites; and (4) in vivo deletgion of the MTE to examine the consequences for gene expression. Overall, this is a technically solid study that provides some novel insights into tooth development and extends previous observations by the authors (Jackman & Stock, 2006; PNAS). However, the added value of the manuscript is limited by both the narrow experimental scope and the relatively modest impact of the findings for the broader field of developmental biology.

      Main concerns:

      (1) My main concern is that the study restricts the search for cis-regulatory information to the 5' region 4kb upstream of the TSS of the gene, rather than encompassing the full genomic locus. This is particularly limiting given that a knock-in allele was generated, which in principle allows interrogation of regulatory elements across the entire locus, and that the authors acknowledge the availability of genome-wide regulatory datasets (e.g. DANIO-CODE) in the Discussion. Despite this, no systematic effort is made to test additional regulatory elements beyond the proximal promoter/enhancers.<br /> This has important implications for the interpretation of the current work as: (a) dlx2b, as many developmental genes, resides in a gene desert enriched in open chromatin regions that may function as distal enhancers, and (b) the deletion of the MTE unmasked a cis-regulatory activity which nature cannot be explained with the information provided, and that may seem relevant for the expression of the gene in the dental mesenchyme.

      (2) A second concern is the absence of information on the functional consequences of deleting the gene or the MTE on tooth primordium development. From the description of the KI strategy, it is unclear whether the GFP insertion results in a functional fusion protein. The cytoplasmic GFP distribution and the schematic in Figure S1 instead suggest the presence of a terminal stop codon in the GFP sequence, which would result in a dlx2b loss-of-function allele. If this interpretation is correct, the manuscript does not describe the developmental consequences in homozygous embryos. Similar concerns apply to the MTE deletion: it remains unclear whether loss of this enhancer results in any detectable morphological or developmental defects.

    1. eLife Assessment

      This fundamental study presents experimental evidence on how geomagnetic and visual cues are integrated in a nocturnally migrating insect. The evidence supporting the conclusions is compelling. The work will be of broad interest to researchers studying animal migration and navigation.

    2. [Editors' note: this version has been assessed by the Reviewing Editor without further input from the original reviewers. The authors have addressed the comments raised in the previous round of review.]

      Reviewer #1 (Public review):

      Summary:

      The manuscript by Ma et al. provides robust and novel evidence that the noctuid moth Spodoptera frugiperda (Fall Armyworm) possesses a complex compass mechanism for seasonal migration that integrates visual horizon cues with Earth's magnetic field (likely its horizontal component). This is an important and timely study: apart from the Bogong moth, no other nocturnal Lepidoptera has yet been shown to rely on such a dual-compass system. The research therefore expands our understanding of magnetic orientation in insects with both theoretical (evolution and sensory biology) and applied (agricultural pest management, a new model of magnetoreception) significance.

      The study uses state-of-the-art methods and presents convincing behavioural evidence for a multimodal compass. It also establishes the Fall Armyworm as a tractable new insect model for exploring the sensory mechanisms of magnetoreception, given the experimental challenges of working with migratory birds. Overall, the experiments are well designed, the analyses are appropriate, and the conclusions are generally well supported by the data.

      Strengths:

      • Novelty and significance: First strong demonstration of a magnetic-visual compass in a globally relevant migratory moth species, extending previous findings from the Bogong moth and opening new research avenues in comparative magnetoreception.<br /> • Methodological robustness: Use of validated and sophisticated behavioural paradigms and magnetic manipulations consistent with best practices in the field. The use of 5 min bins to study a dynamic nature of magnetic compass which is anchored to a visual cue but updated with latency of several minutes is an important finding and a new methodological aspect in insect orientation studies.<br /> • Clarity of experimental logic: The cue-conflict and visual cue manipulations are conceptually sound and capable of addressing clear mechanistic questions.<br /> • Ecological and applied relevance: Results have implications for understanding migration in an invasive agricultural pest with expanding global range.<br /> • Potential model system: Provides a new, experimentally accessible species for dissecting the sensory and neural bases of magnetic orientation.

      Weaknesses:

      Overall, this is a strong study, and the authors have completed an excellent major revision.

    3. Reviewer #2 (Public review):

      Summary:

      The work titled "Geomagnetic and visual cues guide seasonal migratory orientation in the nocturnal fall armyworm, the world's most invasive insect" provided experimental evidence on how geomagnetic and visual cues are integrated, and visual cues are indispensable for magnetic orientation in the nocturnal fall armyworm.

      Strengths:

      It has been demonstrated that the Australian Bogon moth could integrate global stellar cues with the geomagnetic field for long distance navigation. However, data are lacking for other insects. This study suggested that the integration of geomagnetic and visual cues may represent a conserved navigational mechanism broadly employed across migratory insects.

      Weaknesses:

      The visual cues used in the indoor experimental system designed by the authors may have some limitations in ecological relevance. The author may need more explanations on this experimental system.

      In the revised manuscript, the authors have added explanations in the discussion section. I am fine with the revision.

    4. Author response:

      The following is the authors’ response to the previous reviews

      Public Reviews:

      Reviewer # 1 (Public review):

      (1) Structure and Presentation of Results

      • I recommend reordering the visual-cue experiments to progress from simpler conditions (no cues) to more complex ones (cue-conflict). This would improve narrative logic and accessibility for non-specialist readers. The authors have chosen not to implement this suggestion, which I respect, but my recommendation stands.

      Thank you for this suggestion. We understand your point that presenting the experiments from simpler to more complex conditions may seem more intuitive. However, we have kept the original order because it better reflects the logic of the study itself. Our work first asked whether fall armyworms, like the Bogong moth, use a magnetic compass that is integrated with visual cues. Only after establishing this behavioral feature did we go on to test whether visual cues are required to maintain magnetic orientation. To make this reasoning clearer to readers, we have explicitly stated in the Introduction that magnetic orientation in the Bogong moth depends on the integration of visual cues, which provides clearer context for the experimental design.

      (2) Ecological Interpretation

      • The authors should expand their discussion on how the highly simplified, static cue setup translates to natural migratory conditions, where landmarks are dynamic, transient, or absent. Specifically, further consideration is needed on how the compass might function when landmarks shift position, become obscured, or are replaced by celestial cues. Additionally, the discussion would benefit from a more consolidated section with concrete suggestions for future experiments involving transient, multiple, or more naturalistic visual cues. This point was addressed partially in one paragraph of the Discussion, which reads as follows:

      "In nature, they are likely to encounter a range of luminance-gradient visual cues, including relatively stable celestial cues as well as transient or shifting local features encountered en route. Although such natural cues differ from our simplified laboratory stimulus, they may represent intermittently sampled visual inputs that can be optimally integrated with magnetic information, with the congruency between visual and magnetic cues likely playing a key role in maintaining a stable compass response. Whether the cues are static or changing, brief periods without them may still allow the subsequent recovery of a stable long-distance orientation strategy. Determining which types of natural visual cues support the magnetic-visual compass, and how they interact with magnetic information, including how their momentary alignment or angular relationship is integrated and how such visual cue-magnetic field interactions may require time to influence orientation, together with elucidating the genetic and ecological bases of multimodal orientation, will be important objectives for future research." While this paragraph is informative, the wording remains lengthy, somewhat unclear, and vague. Shorter, clearer statements would improve readability and impact. For example:

      • How could moths maintain direction during periods when only the magnetic field is present and visual landmarks are absent?

      • Could celestial cues (e.g., stars) compensate, and what happens if these are also obscured?

      • What role does saliency play when multiple visual landmarks are present simultaneously?

      • How might a complex skyline without salient landmarks affect orientation?

      Including simple, concise sentences that pose concrete open questions and suggest experimental designs would strengthen the discussion without creating space issues. In my view, a comprehensive discussion of how the simplified, static cue setup relates to natural migratory conditions-where landmarks are dynamic, transient, or absent-would add significant value to the paper.

      Thank you for this constructive and insightful comment. You correctly point out that our articulation of the ecological relevance of the simplified, static cue setup was not sufficiently clear. We also agree that the original wording in the Discussion remained overly general. In the revised Discussion, we updated the manuscript to incorporate recently published findings on the use of light–dark gradients for orientation in fall armyworms. However, we explicitly note that it remains unclear whether fall armyworms can exploit naturally occurring luminance gradients, such as those generated by the moon, for orientation under natural conditions. We further emphasize that during natural migration the visual environment is dynamic, with celestial cues available intermittently and local visual features changing continuously during flight. In this context, we outline several key unresolved questions, including whether celestial cues can compensate when local landmarks are absent; how multiple visual cues are weighted and integrated with geomagnetic information; how transient visual cues (like moving clouds or changing illumination) influence orientation; and how luminance gradients that are common in natural nocturnal environments interact with the geomagnetic field to support orientation. For each of these issues, we briefly suggest experimental approaches to guide future research.

      (3) Methodological Details and Reproducibility

      • The lack of luminance level measurements should be explicitly highlighted.

      Thank you for your helpful suggestion. You are right that luminance level is an important experimental parameter. We have stated this information in the Methods section under Behavioral apparatus: “The ambient light level in the experimental environment was measured to be below 1 lux using a Testo 540 lux meter (Testo SE & Co. KGaA, Titisee-Neustadt, Germany). Further work is still required to compare the illuminance used in this study with that under natural conditions, which are inherently variable.” This point is also clarified in the legend of Figure S3 in the supplementary material.

      • The authors chose not to adjust figure legends by replacing "magnetic South" with "magnetic North." While I believe this would be more conventional and preferable, this is ultimately a minor stylistic issue.

      Thank you very much for your suggestion. We understand your point and agree that using “magnetic North” would be more conventional. However, because our experiments focus on the orientation behavior of the autumn population, magnetic South is aligned with the landmark direction representing the potential migratory direction, which we believe makes the figures more intuitive for readers. We therefore consider this a minor stylistic issue.

      (4) Conceptual Framing and Discussion

      • Although the authors made a good attempt to explain the limitations of using an artificial visual cue, I believe there is room or a more explicit argument. For example, it could be stated clearly that this species is unlikely to encounter a situation in nature where a single, highly salient landmark coincides with its migratory direction. Therefore, how these findings translate to real migratory contexts remains an open question. A sentence or two making this point directly would strengthen the discussion.

      Thank you for your helpful suggestion. We now address this point explicitly in the Discussion, noting that fall armyworms are unlikely to experience a natural visual environment dominated by a single, static, and highly salient landmark coinciding with their migratory direction. Consequently, how these findings translate to real migratory contexts remains an open question.

      (5) Technical and Open-Science Points

      • Sharing the R code openly (e.g., via GitHub) should be seriously considered. The code does not need to be perfectly formatted, but making it available would be highly beneficial from an open-science perspective.

      Thank you for the suggestion. We agree that making code openly available is valuable from an open-science perspective. The MMRT script used in this study is Moore’s Modified Rayleigh Test, available from the original publication by Massy et al. (2021; https://doi.org/10.1098/rspb.2021.1805). In the previous version, we only cited this reference in the Materials and Methods section; we have now added a direct link to the script to improve clarity and accessibility. We have also provided a public link to the data-recording scripts used in the Flash Flight Simulator (https://doi.org/10.17632/6jkvpybswd.1). This repository additionally includes a map-based optical flow script that was not used in the present study but is shared for completeness.

      Reviewer #1 (Recommendations for the authors):

      • LL. 133-137 (end of paragraph starting with "The fall armyworm is a migratory crop pest native to the Americas"): Suggest splitting into shorter, clearer sentences. The limitations of this method could be better articulated here and elaborated in the Discussion.

      Thank you for this suggestion. We have revised this paragraph by splitting it into shorter, clearer sentences and by articulating the limitations of this method more explicitly. These limitations are further elaborated in the Discussion.

      • LL. 181-185 (end of paragraph starting with "To examine if fall armyworms integrate geomagnetic and visual cues for seasonal migratory orientation"): It would be helpful to state explicitly that season-specific headings have been confirmed in the lab using a flight simulator, but destination regions remain unknown without further tracking experiments.

      Thank you for this helpful suggestion. We have now clarified in the revised manuscript that season-specific orientation headings have been confirmed in the laboratory using a flight simulator, while the actual migratory destination regions remain unclear in the absence of tracking experiments.

      • LL. 230-234 (start of paragraph "Our previous research showed that fall armyworms reared under artificially simulated fall conditions…"): Clarify which migratory season is being referenced.

      Thank you for this helpful suggestion. We have clarified in the text that the migratory season referenced here is the autumn migratory season. In addition, we have added information in the Methods to specify the actual calendar season during which the insects were reared under the simulated conditions.

      • LL. 270-272 (middle of Fig. 2 caption): Suggest explicitly mentioning that for this population, the seasonally appropriate direction is southbound in autumn and northbound in spring, as this may not be clear to non-specialists.

      Thank you for this helpful suggestion. We have now explicitly stated the seasonally appropriate migratory directions for this population, indicating southbound migration in autumn and northbound migration in spring, to improve clarity for non-specialist readers.

      • LL. 421 (middle of paragraph starting with "We also considered the limitations of the Rayleigh test…"): Add that the groups lacking visual cues exhibited "lower directedness as per lower vector length (r)" in addition to lower flight stability.

      Thank you for this helpful suggestion. We further note that the conclusions drawn from the flight stability analysis are consistent with those based on individual r-value analyses.

      • LL. 499-501 ("unlike some vertebrates that can rely solely on magnetic information (Mouritsen, 2018)"): This point is slightly downplayed. It should be emphasized that nearly all tested vertebrates and invertebrates (e.g., birds, mole rats, fish, frogs, and other insects) demonstrate a magnetic compass without requiring visual landmarks. Moths are the only tested invertebrates so far that show landmark-magnetic field dependency for their magnetic compass to be manifested in a behavioural orientation response in Flight Simulator.

      Thank you for this important comment. We agree that this point represents a key synthesis in the Discussion, as it concerns how our findings relate to, and differ from, magnetic orientation demonstrated in other animal groups. We have therefore expanded the Discussion to note that studies have shown that some animals can exhibit directional preferences in simplified visual environments solely in response to changes in the magnetic field, and we now cite representative examples from birds and mole rats. At the same time, we also acknowledge important methodological and phenotypic differences among taxa. In particular, moths’ magnetic orientation has been assessed using a flight simulator, a setup in which stable directional behavior must be actively maintained during continuous movement. This is an important difference from orientation assays in birds during take-off or in terrestrial mammals such as mole rats. Moreover, whether birds and other animals rely on visual input to detect or calibrate magnetic information under certain conditions remains an open question. We therefore emphasize here both the phenotypic differences observed across experimental systems and the methodological considerations.

      • LL. 560-565 (paragraph starting with "Our flight simulator system (Dreyer et al., 2021) …"): Suggest clarifying what the Flash flight simulator system is and how it differs from the Mouritsen-Frost flight simulator.

      Thank you for this suggestion. We have added a brief clarification of the Flash flight simulator and how it differs from the Mouritsen–Frost system.

      • LL. 605-608 ("Spectral measurements …"): Explicitly mention that total illuminance was not measured and that further work is required to compare the illuminance used with natural conditions which of course vary.

      Thank you for this helpful suggestion. We agree that total illuminance is an important factor. We have now added a statement noting that the ambient light level in the experimental environment was measured to be below 1 lux using a Testo 540 lux meter, and we further acknowledge that additional work is required to compare the illuminance used in this study with that under naturally variable conditions.

      • LL. 628-641 (end of paragraph starting with "Electromagnetic noise at the experimental site ... "): Explain why this matters for interpreting behavioural responses. Highlight that although conditions were somewhat magnetically noisy which based on the past work may disrupt magnetic compass as it was shown in birds (eg Engels et al. 2014 Nature), the observed magnetic response under certain conditions indicates that the magnetic sense remained functional when landmark and magnetic field were aligned. This way you can pre-empt this criticism of your magnetic conditions being not ideal and noise on the left handside of the spectrum measured (which is not uncommon).

      Thank you for this helpful suggestion. We have now cited Engels et al. (2014, Nature) in this section and expanded the text to explain why electromagnetic noise at the experimental site is relevant for interpreting the behavioural responses. We also clarify the rationale for measuring electromagnetic noise and discuss the observed low-frequency (“left-hand side”) noise in the spectrum.

      • Fig. 51: Suggest adapting Y-axes and using violin or box plots (e.g., panels A/B starting from 30 up to 50, etc.).

      Thank you for this helpful suggestion. We have revised Fig. 5 accordingly by adapting the Y-axis scaling and replacing the original plots with box plots, as suggested.

    1. eLife Assessment

      This valuable study advances our understanding of how organisms respond to chronic oxidative stress. Using the nematode C. elegans, the authors identified key neuronal signaling molecules and their receptors that are required for stress signaling and survival. The evidence supporting the conclusions is solid, including rigorous genetics, stress response analysis, and transcriptional profiling. This research will be of broad interest to neuroscientists and researchers working in the field of oxidative stress regulation.

    2. Reviewer #1 (Public review):

      Summary:

      The researchers aimed to identify which neurotransmitter pathways are required for animals to withstand chronic oxidative stress. This work thus has important implications for disease processes that are caused/linked to oxidative stress. This work identified specific neurotransmitters and receptors that coordinate stress resilience, both prior to and during stress exposure. Further, the authors identified specific transcriptional programs coordinated by neurotransmission that may provide stress resistance.

      Strengths:

      The manuscript is very clearly written with a well-formulated rationale. Standard C. elegans genetic analysis and rescue experiments were performed to identify key regulators of the chronic oxidative stress response. These findings were enhanced by transcriptional profiling that identified differentially expressed genes that likely affect survival when animals are exposed to stress.

      Weaknesses:

      Where the gar-3 promoter drives expression was not discussed in the context of the rescue experiments in Fig 7.

      Comments on revisions:

      This issue has now been appropriately addressed in the revision.

    3. Reviewer #2 (Public review):

      In this paper, Biswas et al. describe the role of acetylcholine (ACh) signaling in protection against chronic oxidative stress in C. elegans. They showed that disruption of ACh signaling in either unc-17 mutant or gar-3 mutants led to sensitivity to toxicity caused by chronic paraquat (PQ) treatment. Using RNA seq, they found that approximately 70% of the genes induced by chronic PQ exposure in wild type failed to upregulate in these mutants. The overexpression of gar-3 selectively in cholinergic neurons was sufficient to promote protection against chronic PQ exposure in an ACh-dependent manner. The study points to a previously undescribed role for ACh signaling in providing organism-wide protection from chronic oxidative stress likely through the transcriptional regulation of numerous oxidative stress-response genes. The paper is well-written, and the data are robust, though some conclusions seem preliminary and are not fully support the current data (see below). While the study identifies the muscarinic ACh receptor gar-3 as an important regulator of the response to PQ, the specific neurons in which gar-3 functions were not unambiguously identified, and the sources of ACh that regulate GAR-3 signaling and the identities of the tissues targeted by gar-3 were not addressed.

      Comments on revisions:

      The authors addressed my comments adequately in their revised submission. Please include representative images to accompany the quantification of the new results presented in Fig S4A.

    4. Author response:

      The following is the authors’ response to the original reviews

      Reviewer #1 (Public review):

      The researchers aimed to identify which neurotransmitter pathways are required for animals to withstand chronic oxidative stress. This work thus has important implications for disease processes that are caused/linked to oxidative stress. This work identified specific neurotransmitters and receptors that coordinate stress resilience, both prior to and during stress exposure. Further, the authors identified specific transcriptional programs coordinated by neurotransmission that may provide stress resistance.

      Strengths:

      The manuscript is very clearly written with a well-formulated rationale. Standard C. elegans genetic analysis and rescue experiments were performed to identify key regulators of the chronic oxidative stress response. These findings were enhanced by transcriptional profiling that identified differentially expressed genes that likely affect survival when animals are exposed to stress.

      We thank the reviewer for their positive assessment.

      Weaknesses:

      Where the gar-3 promoter drives expression was not discussed in the context of the rescue experiments in Figure 7.

      We now provide information about expression using 7.5 kb gar-3 promoter fragment  and compare directly with our analysis of endogenous gar-3 expression using the genome-modified gar-3::SL2::GFP strain (Page 16, new Figures 8 and S3).

      Reviewer #1 (Recommendations for the authors):

      (1) Figure 3B is not mentioned in the text.

      Fixed. Figure 3B is now called out on page 10 of the revised manuscript.

      (2) The rationale for using the specific PQ concentration was not provided.

      We selected this concentration based on its use for chronic assays by other studies in the field to allow for direct comparison with our results. We now clarify this point in the Methods section (Page 26 of the revised text).

      (3) Transgenic animals injected with the unc-17βp::gar-3 transgene (25 ng/μL) displayed strikingly increased survival in the presence of 4 mM PQ compared to either gar-3 mutants or wild type (should have a Figure cited here)

      Fixed. Figure 9E is now referenced on Page 19 of the revised text.

      (4) The text describing Figure 7C details a comparison with the gar-3 single mutant but the graph shows the unc-17 single mutant

      Figure 7C is a comparison of the survival of gar-3 single mutants with either wild type or gar-3;ric-3 double mutants as described in the text.

      Reviewer #2 (public comments)

      In this paper, Biswas et al. describe the role of acetylcholine (ACh) signaling in protection against chronic oxidative stress in C. elegans. They showed that disruption of ACh signaling in either unc-17 mutants or gar-3 mutants led to sensitivity to toxicity caused by chronic paraquat (PQ) treatment. Using RNA seq, they found that approximately 70% of the genes induced by chronic PQ exposure in wild type failed to upregulate in these mutants. The overexpression of gar-3 selectively in cholinergic neurons was sufficient to promote protection against chronic PQ exposure in an ACh-dependent manner. The study points to a previously undescribed role for ACh signaling in providing organism-wide protection from chronic oxidative stress, likely through the transcriptional regulation of numerous oxidative stressresponse genes. The paper is well-written, and the data are robust, though some conclusions seem preliminary and do not fully support the current data. While the study identifies the muscarinic ACh receptor gar-3 as an important regulator of the response to PQ, the specific neurons in which gar-3 functions were not unambiguously identified, and the sources of ACh that regulate GAR-3 signaling and the identities of the tissues targeted by gar-3 were not addressed, limiting the scope of the study.

      We thank the reviewer for their positive assessment. We provide additional data and discussion of the points raised by the reviewer in the revised manuscript. In particular, as suggested by the reviewer, we conducted additional tissue-specific rescue experiments to try to better define GAR-3 site of action. We found that specific rescue of gar-3 expression in either cholinergic motor neurons or muscles each provide partial rescue. In addition, we quantified the expression of the nhr-185 and fbxa-73 genes, identified as upregulated by PQ in our RNA-seq studies, following oxidative stress (new Fig. S4). We observed increased expression of both genes following PQ exposure, providing independent confirmation for transcriptional upregulation of these genes as part of the stress response. See the responses to points #1 and #3 below for additional details.

      Major Comments:

      (1) The site of action of cholinergic signaling for protection from PQ was not adequately explored. The authors' conclusion that cholinergic motor neurons are protective is based on studies using overexpression of gar-3 and an unc-17 allele that may selectively disrupt ACh in cholinergic motor neurons (Figure 9F), but these approaches are indirect. To more directly address the site of action, the authors should conduct rescue experiments using well-defined heterologous promoters. Figure 7G shows that gar-3 expressed under a 7.5 kb promoter fragment fully rescues the defect of gar-3 mutants, but the authors did not report where this promoter fragment is expressed, nor did they conduct rescue experiments of the specific tissues where gar-3 is known to be expressed (cholinergic neurons, GABAergic neurons, pharynx, or muscles). UNC-17 rescue experiments could also be useful to address the site of action. Does expression of unc-17 selectively in cholinergic motor neurons rescue the stress sensitivity of unc-17 mutants (or restore resistance to gar-3(OE); unc-17 mutants)? These experiments may also address whether ACh acts in an autocrine or paracrine manner to activate gar-3, which would be an important mechanistic insight to this study that is currently lacking.

      We performed additional rescue experiments using heterologous promoters to drive gar-3 expression in cholinergic neurons or muscle and found that each provided a small, but significant degree of rescue as assessed from Kaplan-Meier survival curves. These results are presented in Figure 8 of the revised manuscript. We have not conducted similar unc-17 rescue experiments; however, we point out that cellspecific unc-17 knockdown by RNAi using the unc-17b promoter (expression largely restricted to ventral cord ACh motor neurons) increases sensitivity to PQ in our long-term survival assays (Figure 3A). Combined with our analysis of unc-17(e113) mutants, we believe these results support a requirement for unc-17 expression in cholinergic motor neurons.

      (2) The genetic pan-neuronal silencing experiments presented in Figure 1 motivated the subsequent experiments, but the authors did not relate these observations to ACh/gar-3 signaling. For example, the authors did not address whether silencing just the cholinergic motor neurons at the different times tested has the same effects on survival as pan-neuronal silencing.

      We used the pan-neuronal silencing to motivate further analysis of various neurotransmitter systems. Our genetic studies implicate both glutamatergic and cholinergic systems in protective responses to oxidative stress. The effects of pan-neuronal silencing on survival during long-term PQ exposure may therefore be derived solely from cholinergic neurons, glutamatergic neurons, or a combination of both neuronal populations. Distinguishing between these possibilities may be quite complicated and is not central to the main message of our paper. We therefore suggest this additional analysis lies outside the scope of this revision. Nonetheless, to address the reviewer’s point, in the revised text we expand our discussion relating the pan-neuronal silencing results to our analysis of ACh signaling (pages 21-22).

      (3) It is assumed that protection occurs through inter-tissue signaling of ACh to target tissues, where it impacts gene expression. While this is a reasonable assumption, it has not been directly shown here. It is recommended that the authors examine GFP reporter expression of a sampling of the genes identified in this study (including proteasomal genes that the authors highlight) that are regulated by unc-17 and gar-3. This would serve to independently confirm the RNAseq data and to identify target tissues that are subject to gene expression regulation by ACh, which would significantly strengthen the study.

      Agreed. To address this question, we investigated expression of the nhr-185 and fbxa-73 genes implicated as upregulated by oxidative stress in our RNA-seq studies. Consistent with our RNA-seq findings, we observed significantly increased expression of a nhr-185pr::GFP transcriptional reporter, primarily in the pharynx and anterior intestine, following 48 hrs of PQ exposure. These results support transcriptional upregulation of expression in these tissues as part of the stress response. fbxa-73 was among the proteasomal genes implicated as oxidative stress-responsive by RNA-seq. Consistent with this finding, by quantitative RT-PCR we observed a significant increase in fbxa-73 expression in wild type animals following 48 hrs of PQ treatment. These new results provide independent confirmation of the gene expression changes we observed by RNA-seq and are now included in new Figure S4 and discussed on Pages 17-18 of the revised manuscript.

      Reviewer #2 (Recommendations for the authors):

      (1) As an independent way of addressing whether enhanced ACh signaling is sufficient for protection, the authors could examine stress resistance in ace mutants, as was reported in PMID: 39097618, or in mutants with increased ACh secretion.

      We thank the reviewer for this suggestion. We are pursuing the impacts of increased cholinergic activation in a separate study. We are pursuing experiments along the lines the reviewer suggests as one facet of this independent study. Our findings here provide evidence that increasing GAR-3 signaling in ACh motor neurons by cell-specific overexpression enhances protection. 

      (2) To address the specificity of ACh signaling by gar-3 for this response, the authors could report survival data for mutants lacking each of the other two mACh receptors, gar-1 and gar-2.

      We thank the reviewer for this suggestion. We now include new data showing that gar-3;gar-2 double mutants have similar survival to gar-3 single mutants in the presence of PQ new Figure 7F). We agree that further studies of additional GPCRs (e.g. gar-1 and metabotropic glutamate receptors) will be required to definitively establish specificity for GAR-3 and we now acknowledge this point on page 15 of the revised text.

      (3) Do carbonylation levels correlate with toxicity? For example, do gar-3 mutants have more carbonylation and gar-3 OE have less?

      This is an interesting question. To try to address this, we performed additional protein carbonylation experiments for unc-17 and gar-3 mutants. We found a similar increase in protein carbonylation following PQ exposure for gar-3 mutants as observed for wild type; however, we also noted a higher level a batch-to-batch variability for gar-3 compared with wild type and are therefore hesitant to draw firm conclusions. We have not included these data in the revised manuscript but provide them for the reviewer’s information here (Author response image 1 shows our prior N2 data for comparison). We were not able to conduct similar experiments for unc-17 mutants because we noted local starvation when the animals were grown at the high density required to obtain the protein quantities needed for these experiments.

      Author response image 1.

      (4) Citations in text for Figures 4A and 8A are missing.

      Fixed. Figures 4A and 8A (now 9A) are cited on pages 10 and 17 of the revised text, respectively.

      (5) Figures 4-6 and 8 have limited information content. Condense or move to supplementary.

      While we acknowledge the reviewer’s viewpoint here, we believe that the analyses of the transcriptional responses described in Figures 4-6 and 8 are central to the study. To address reviewers’ comments, we have included a new Figure 8 and merged previous Figures 8 and 9 (new Figure 9) in the revised manuscript.

      (6) "expression of" is repeated in "Finally, transgenic expression of expression of a wild-type GAR-3::YFP"

      Fixed.

    1. eLife Assessment

      This important study shows that orientation tuning of V1 neurons is suppressed during a continuous flash suppression paradigm, especially in neurons with binocular receptive fields. These findings, made using cutting-edge imaging techniques, convincingly implicate early visual processing in continuous flash suppression, in agreement with previous studies suggesting reduced effective contrast of such stimuli in V1.

    2. Reviewer #1 (Public review):

      This study makes a fundamental contribution to our understanding of interocular suppression, particularly continuous flash suppression (CFS). Using neuroimaging data from two macaque monkeys, the study provides compelling evidence that CFS suppresses orientation responses in neurons within V1. These findings enrich the CFS literature by demonstrating that neural activity under CFS may prevent high-level visual and cognitive processing.

      Comments on revisions:

      The authors have addressed all my previous comments.

    3. Reviewer #2 (Public review):

      Summary:

      The goal of this study was to investigate the degree to which low-level stimulus features (i.e., grating orientation) are processed in V1 when stimuli are not consciously perceived under conditions of continuous flash suppression (CFS). The authors measured the activity of a population of V1 neurons at single neuron resolution in awake fixating monkeys while they viewed dichoptic stimuli that consisted of an oriented grating presented to one eye and a noise stimulus to the other eye. Under such conditions, the mask stimulus can prevent conscious perception of the grating stimulus. By measuring the activity of neurons (with Ca2+ imaging) that preferred one or the other eye, the authors tested the degree of orientation processing that occurs during CFS.

      Strengths:

      The greatest strength of this study is the spatial resolution of the measurement and the ability to quantify stimulus representations during CSF in populations of neurons preferring the eye stimulated by either the grating or the mask. There have been a number of prominent fMRI studies of CFS, but all of them have had the limitation of pooling responses across neurons preferring either eye, effectively measuring the summed response across ocular dominance columns. The ability to isolate separate populations offers an exciting opportunity to study the precise neural mechanisms that give rise to CFS, and potentially provide insights into nonconscious stimulus processing.

      Weaknesses:

      However, while this is an impressive experimental setup, the major weakness of this study is that the experiments don't advance any theoretical account of why CFS occurs or what CFS implies for conscious visual perception. There are two broad camps of thinking with regard to CFS. On the one hand, Watanabe et al., 2011 reported that V1 activity remained intact during CFS, implying that CFS interrupts stimulus processing downstream of V1. On the other hand, Yuval-Greenberg and Heeger (2013) showed that V1 activity is in fact reduced during CFS. By using a parametric experimental design, they measured the impact of the mask on the stimulus response as a function of contrast, and concluded that the mask reduces the gain of neural responses to the grating stimulus. They presented a theoretical model in which the mask effectively reduced the SNR of the grating, making it invisible in the same way that reducing contrast makes a stimulus invisible.

      In the first submission of the manuscript, the authors incorrectly described the Yuval-Greenberg & Heeger (2013) paper and Watanabe et al. (2011) papers, suggesting that they had observed the same or similar effects of CFS on V1 activity, when in fact they had described opposite results. Reviewer 1 also observed that the authors appeared to be confused in their reading of these highly relevant papers. In the revision, the authors have reworked this paragraph, now correctly describing these sets of opposing results. However, I still do not understand what the authors are trying to argue: "...these studies were not designed to quantify the pure effect of CFS on stimulus-evoked V1 responses." I do not understand what is meant by "pure" in this case. Regardless, it is clear that the measurements in the present study strongly support the interpretation of Yuval-Greenberg & Heeger (i.e., that V1 activity is degraded by CFS, 'akin' to a loss in the contrast-to-noise ratio of neural activity). It would be appropriate for the authors to communicate this clearly.

      I continue to be of the opinion that this study is lacking an adequate model of interocular interactions that might explain the Ca2+ imaging. The machine learning results are not terribly surprising - multivariate methods, such as SVMs, are more sensitive than univariate approaches. So it is plausible that an SVM can support decoding of the coarse orientation information, even when no tuning is evident in the univariate analyses. However, the link between this result and the underlying neurophysiology is opaque. The failure to model the neural data with an explicit model is a missed opportunity.

    4. Reviewer #3 (Public review):

      Summary:

      In this study, Tang, Yu & colleagues investigate the impact of continuous flash suppression (CFS) on the responses of V1 neurons using 2-photon calcium imaging. The report that CFS substantially suppressed V1 orientation responses. This suppression happens in a graded fashion depending on the binocular preference of the neuron: neurons preferring the eye that was presented with the marker stimuli were most suppressed, while the neurons preferring the eye to which the grating stimuli were presented were least suppressed. Binocular neuron exhibited an intermediate level of suppression.

      Strengths:

      The imaging techniques are cutting-edge.

      Weaknesses:

      The strength of CFS suppression varies across animals, but the authors attribute this to comparable heterogeneity in the human psychophysics literature.

      Comments on revisions:

      The authors have addressed my comments from the previous round of review, and I have no further comments

    5. Author response:

      The following is the authors’ response to the original reviews

      eLife Assessment

      This important study shows that orientation tuning of V1 neurons is suppressed during a continuous flash suppression paradigm, especially when the neurons have a binocular receptive field. However, the evidence presented is incomplete and, in particular, does not distinguish whether this suppression is due to reduced contrast or due to masking.

      This assessment is primarily based on the critique of Reviewer 2 that our results do not distinguish whether the impact of CFS is due to reduced contrast or due to masking. Reviewer 2 referred to Yuval-Greenberg and Heeger (2013), noting that: “V1 activity is, in fact, reduced during CFS … the mask reduces the gain of neural responses to the grating stimulus … making it invisible in the same way that reducing contrast makes a stimulus invisible.” To be precise, Yuval-Greenberg and Heeger (2013) used “akin to”, instead of “the same way”, in their abstract.

      We agree that CFS masking and contrast reduction can both lower the signal-to-noise ratio and thereby reducing visibility. However, these two factors operate in fundamentally different ways. According to gain control models by Heeger and others, reducing the physical contrast of a stimulus decreases the excitatory drive, while dichoptic masking increases the normalization pool. Our findings therefore reflect genuine masking-induced suppression and are not attributable to stimulus contrast reduction.

      Public Reviews:

      Reviewer #1 (Public review):

      Disclaimer: While I am familiar with the CFS method and the CFS literature, I am not familiar with primate research or two-photon calcium imaging. Additionally, I may be biased regarding unconscious processing under CFS, as I have extensively investigated this area but have found no compelling evidence in favor of unconscious processing under CFS.

      This manuscript reports the results of a nonhuman-primate study (N=2 behaving macaque monkeys) investigating V1 responses under continuous flash suppression (CFS). The results show that CFS substantially suppressed V1 orientation responses, albeit slightly differently in the two monkeys. The authors conclude that CFS-suppressed orientation information "may not suffice for high-level visual and cognitive processing" (abstract).

      The manuscript is clearly written and well-organized. The conclusions are supported by the data and analyses presented (but see disclaimer). However, I believe that the manuscript would benefit from a more detailed discussion of the different results observed for monkeys A and B (i.e., inter-individual differences), and how exactly the observed results are related to findings of higher-order cognitive processing under CFS, on the one hand, and the "dorsal-ventral CFS hypothesis", on the other hand.

      Thanks for reviewer’s helpful comments and suggestions. We added new contents discussing the inter-individual differences and the "dorsal-ventral CFS hypothesis" in the revision, and made other changes, which are detailed below.

      Major Comments:

      (1) Some references are imprecise. For example, l.53: "Nevertheless, two fMRI studies reported that V1 activity is either unaffected or only weakly affected (Watanabe et al., 2011; Yuval-Greenberg & Heeger, 2013)". "To the best of my understanding, the second study reaches a conclusion that is entirely opposite to that of the first, specifically that for low-contrast, invisible stimuli, stimulus-evoked fMRI BOLD activity in the early visual cortex (V1-V3) is statistically indistinguishable from activity observed during stimulus-absent (mask-only) trials. Therefore, high-level unconscious processing under CFS should not be possible if Yuval-Greenberg & Heeger are correct. The two studies contradict each other; they do not imply the same thing.

      Sorry we did not make our point clear. Our original concern was that the effects of CFS on V1 activity were underestimated, even in Yuval-Greenberg & Heeger (2013), as both studies compared monocular and dichoptic masking to estimate the influence of visibility. In contrast, in original psychophysical studies, the CFS effect was compared with or with dichoptic masking, which is expected to be stronger. We rewrote the paragraph to clarify.

      “Two prominent fMRI studies have examined the impact of CFS on V1 activity (Watanabe et al., 2011; Yuval-Greenberg & Heeger, 2013). Watanabe et al. (2011) compared monocular CFS masking (stimulus visible) and dichoptic CFS masking (stimulus invisible), and reported that V1 BOLD responses were largely insensitive to stimulus visibility when attention was carefully controlled. However, using similar experimental design, Yuval-Greenberg and Heeger (2013) observed reduced BOLD responses in V1 under dichoptic masking, suggesting that V1 activity changed with stimulus visibility. They attributed the difference of results between two studies mainly to differences in statistical power (~250 trials per condition vs. ~90 trials per condition). Nevertheless, these studies were not designed to quantify the pure effect of CFS on stimulus-evoked V1 responses, as they contrasted monocular and dichoptic masking conditions to equate stimulus input while manipulating perceptual visibility. In contrast, original psychophysical studies (Tsuchiya & Koch, 2005; Tsuchiya, Koch, Gilroy, & Blake, 2006) demonstrated CFS masking by contrasting the visibility of the target stimulus with and without the presence of dichoptic mask. It is apparent that the pure CFS impact in above fMRI studies would be the difference of BOLD signals between binocular masking and stimulus alone conditions. In other words, the impact of CFS on V1 activity should be larger than what has been reported by Yuval-Greenberg and Heeger (2013).” (lines 55-71)

      (2) Line 354: "The flashing masker was a circular white noise pattern with a diameter of 1.89°, a contrast of 0.5, and a flickering rate of 10 Hz. The white noise consisted of randomly generated black and white blocks (0.07 × 0.07 each)." Why did the authors choose a white noise stimulus as the CFS mask? It has previously been shown that the depth of suppression engendered by CFS depends jointly on the spatiotemporal composition of the CFS and the stimulus it is competing with (Yang & Blake, 2012). For example, Hesselmann et al. (2016) compared Mondrian versus random dot masks using the probe detection technique (see Supplementary Figure S4 in the reference below) and found only a poor masking performance of the random dot masks.

      Yang, E., & Blake, R. (2012). Deconstructing continuous flash suppression. Journal of Vision, 12(3), 8. https://doi.org/10.1167/12.3.8

      Hesselmann, G., Darcy, N., Ludwig, K., & Sterzer, P. (2016). Priming in a shape task but not in a category task under continuous flash suppression. Journal of Vision, 16, 1-17.

      In a previous human psychophysical study, we also used the same noise pattern and the CFS effect appeared to be robust (Xiong et al., 2016, https://doi.org/10.7554/eLife.14614). However, we believe that the reviewer made a good point, and weaker suppression due to the use of our stimulus pattern may have contributed to the weaker suppression in Monkey B. This issue is now discussed in the revision regarding the individual variability in our results.

      “In addition, the random-noise masker we used might not be as effective as Mondrian patterns (G. Hesselmann, Darcy, Ludwig, & Sterzer, 2016). If reduced stimulus contrast and a Mondrian masker were used, we predict that CFS suppression in Monkey B would strengthen, potentially approaching the level observed in Monkey A. Nevertheless, it is worth emphasizing that our main conclusions are primarily based on data from Monkey A, who exhibited much stronger CFS suppression.” (lines 321-327)

      (3) Related to my previous point: I guess we do not know whether the monkeys saw the CF-suppressed grating stimuli or not? Therefore, could it be that the differences between monkey A and B are due to a different individual visibility of the suppressed stimuli? Interocular suppression has been shown to be extremely variable between participants (see reference below). This inter-individual variability may, in fact, be one of the reasons why the CFS literature is so heterogeneous in terms of unconscious cognitive processing: due to the variability in interocular suppression, a significant amount of data is often excluded prior to analysis, leading to statistical inconsistencies.

      Yamashiro, H., Yamamoto, H., Mano, H., Umeda, M., Higuchi, T., & Saiki, J. (2014). Activity in early visual areas predicts interindividual differences in binocular rivalry dynamics. Journal of Neurophysiology, 111(6), 1190-1202. https://doi.org/10.1152/jn.00509.2013

      The individual difference issue is now explicitly addressed in the Discussion:

      “Interocular suppression under CFS is known to vary substantially across individuals (Blake, Goodman, Tomarken, & Kim, 2019; Gayet & Stein, 2017; Yamashiro et al., 2013). This inter-individual variability may contribute to the heterogeneity observed in the CFS literature. We also found that the strength of V1 response suppression during CFS differed between two monkeys, as reflected by population orientation tuning functions (Fig. 2C), Fisher information (Fig. 2F), and reconstruction performance by the transformer (Fig. 3E). Several experimental factors may have contributed to the relatively weaker suppression observed in Monkey B. Because monkeys viewed the stimuli passively, we could not determine the dominant eye for each monkey (instead we switched the eyes and averaged the results), and the target was presented at relatively high contrast. Both factors are known to reduce the effectiveness of CFS suppression (Yang, Blake, & McDonald, 2010; Yuval-Greenberg & Heeger, 2013). In addition, the random-noise masker we used might not be as effective as Mondrian patterns (G. Hesselmann, Darcy, Ludwig, & Sterzer, 2016). If reduced stimulus contrast and a Mondrian masker were used, we predict that CFS suppression in Monkey B would strengthen, potentially approaching the level observed in Monkey A. Nevertheless, it is worth emphasizing that our main conclusions are primarily based on data from Monkey A, who exhibited much stronger CFS suppression.” (lines 311-327)

      Moreover, the authors' main conclusion (lines 305-307) builds on the assumption that the stimuli were rendered invisible, but isn't this speculation without a measure of awareness?

      We agree. To correct, we have removed the original lines 305-307 discussing the consciousness perception and reframed the manuscript throughout to focus on the impact of CFS on neural coding rather than on perceptual awareness. For example, the title has been changed to:

      “Continuous flashing suppression of neural responses and population orientation coding in macaque V1”,

      and the ending line of Introduction was changed to:

      “This approach enabled us to investigate the potentially differential impacts of CFS on the responses of V1 neurons with varying ocular preferences, as well as apply machine learning tools to understand the impacts of CFS on V1 stimulus coding at the population level.” (lines 81-83)

      (4) The authors refer to the "tool priming" CFS studies by Almeida et al. (l.33, l.280, and elsewhere) and Sakuraba et al. (l.284). A thorough critique of this line of research can be found here:

      Hesselmann, G., Darcy, N., Rothkirch, M., & Sterzer, P. (2018). Investigating Masked Priming Along the "Vision-for-Perception" and "Vision-for-Action" Dimensions of Unconscious Processing. Journal of Experimental Psychology. General. https://doi.org/10.1037/xge0000420

      This line of research ("dorsal-ventral CFS hypothesis") has inspired a significant body of behavioral and fMRI/EEG studies (see reference for a review below). The manuscript would benefit from a brief paragraph in the discussion section that addresses how the observed results contribute to this area of research.

      Ludwig, K., & Hesselmann, G. (2015). Weighing the evidence for a dorsal processing bias under continuous flash suppression. Consciousness and Cognition, 35, 251-259. https://doi.org/10.1016/j.concog.2014.12.010

      In the revision, we added a new paragraph to discussion issues related to the dorsal-ventral CFS hypothesis.

      “A related issue is the dorsal-ventral CFS hypothesis, which proposes that CFS suppression may disproportionately affect ventral visual processing while relatively preserving dorsal pathways involved in visuomotor functions, potentially allowing category- or action-related information to remain accessible under suppression (Fang & He, 2005). However, subsequent fMRI studies have failed to provide consistent support for this dissociation, reporting either stream-invariant awareness effects (Guido Hesselmann & Malach, 2011; Ludwig et al., 2015; Tettamanti et al., 2017), residual signal in ventral rather than dorsal regions (Fogelson et al., 2014; Guido Hesselmann et al., 2011), or residual low-level feature information/partial visibility rather than preserved dorsal processing (Ludwig et al., 2015). Although our study does not directly test dorsal-ventral dissociations, our V1 results provide a constraint on what information downstream visual pathways could access under suppression. When CFS- induced interocular suppression was strong enough and stimuli reconstruction was markedly reduced, as in the case of Monkey A, the information required for category-level or action-related processing may not be sufficient for high-level cortical representation.” (lines 297-310)

      Reviewer #2 (Public review):

      Summary:

      The goal of this study was to investigate the degree to which low-level stimulus features (i.e., grating orientation) are processed in V1 when stimuli are not consciously perceived under conditions of continuous flash suppression (CFS). The authors measured the activity of a population of V1 neurons at single neuron resolution in awake fixating monkeys while they viewed dichoptic stimuli that consisted of an oriented grating presented to one eye and a noise stimulus to the other eye. Under such conditions, the mask stimulus can prevent conscious perception of the grating stimulus. By measuring the activity of neurons (with Ca2+ imaging) that preferred one or the other eye, the authors tested the degree of orientation processing that occurs during CFS.

      Strengths:

      The greatest strength of this study is the spatial resolution of the measurement and the ability to quantify stimulus representations during CSF in populations of neurons, preferring the eye stimulated by either the grating or the mask. There have been a number of prominent fMRI studies of CFS, but all of them have had the limitation of pooling responses across neurons preferring either eye, effectively measuring the summed response across ocular dominance columns. The ability to isolate separate populations offers an exciting opportunity to study the precise neural mechanisms that give rise to CFS, and potentially provide insights into nonconscious stimulus processing.

      Weaknesses:

      While this is an impressive experimental setup, the major weakness of this study is that the experiments don't advance any theoretical account of why CFS occurs or what CFS implies for conscious visual perception. There are two broad camps of thinking with regard to CFS. On the one hand, Watanabe et al. (2011) reported that V1 activity remained intact during CFS, implying that CFS interrupts stimulus processing downstream of V1. On the other hand, Yuval-Greenberg and Heeger (2013) showed that V1 activity is, in fact, reduced during CFS. By using a parametric experimental design, they measured the impact of the mask on the stimulus response as a function of contrast and concluded that the mask reduces the gain of neural responses to the grating stimulus. They presented a theoretical model in which the mask effectively reduced the SNR of the grating, making it invisible in the same way that reducing contrast makes a stimulus invisible.

      We used multi-class SVM (as suggested by reviewer 3) and a transformer-based model to examine the impact of CFS on the classification of 12 orientations spaced in 15o gaps, which resembles coarse orientation discrimination, as well as on stimulus reconstruction, which resembles stimulus perception necessary for high-level cognitive tasks, respectively. The results suggest that under CFS, an observer may still be able to perform coarse orientation discrimination but not high-level cognitive tasks. These findings provide new insights into the implications of CFS for conscious visual perception from a population decoding perspective.

      In the revision, we also added a new paragraph discussing the implications of our findings for the dorsal-ventral CFS hypothesis, as suggested by reviewer 1. We previously presented a gain control model for our neuronal data in a VSS talk. However, we later decided that, since there are already nice models by Heeger and others, it would be better present something more unique and novel (i.e., machine learning results), which has now become a major component of the manuscript. We welcome the reviewer’s comments on this part.

      An important discussion point of Yuval-Greenberg and Heeger is that null results (such as those presented by Watanabe et al.) are difficult to interpret, as the lack of an effect may be simply due to insufficient data. I am afraid that this critique also applies to the present study.

      We are very much puzzled by the reviewer’s critique. First, our main result is not a null effect. A null effect would mean that CFS masking had no impact on population orientation responses. Instead, we observed a significant suppression or abolished tuning, which clearly indicates a strong effect of dichoptic masking. Second, our findings are based on large neural populations recorded using two-photon imaging, providing extensive sampling and statistical power. Thus, we believe that the reviewer’s critique about “insufficient data” are not applicable to our study.

      Here, the authors report that CFS effectively 'abolishes' tuning for stimuli in neurons preferring the eye with the grating stimulus. The authors would have been in a much stronger position to make this claim if they had varied the contrast of the stimulus to show that the loss of tuning was not simply due to masking.

      We are sorry that we cannot follow the logic here either. Even if “the mask effectively reduced the SNR of the grating, making it invisible in the same way that (“akin to”, to be more precise according to the abstract of Yuval-Greenberg and Heeger (2013)) reducing contrast makes a stimulus invisible”, it does not necessarily mean that dichoptic masking and contrast reduction are the same process or are based on the same neuronal mechanisms. According to gain control models by Heeger and others, reducing the stimulus contrast decreases the excitatory drive, while dichoptic masking increases the normalization pool via interocular suppression, both of which lower SNR, but are two fundamentally distinct processes.

      Therefore, varying the stimulus contrast might reveal a main effect of contrast, and possibly an interaction between contrast and dichoptic masking, but it would neither prove nor disprove the main effect of dichoptic masking.

      So, while this is an incredibly impressive set of measurements that in many ways raises the bar for in vivo Ca2+ imaging in behaving macaques, there isn't anything in the results that constitutes a real theoretical advance.

      We sincerely hope that the reviewer would have a better judgment after reading our responses.

      Reviewer #3 (Public review):

      Summary:

      In this study, Tang, Yu & colleagues investigate the impact of continuous flash suppression (CFS) on the responses of V1 neurons using 2-photon calcium imaging. The report that CFS substantially suppressed V1 orientation responses. This suppression happens in a graded fashion depending on the binocular preference of the neuron: neurons preferring the eye that was presented with the marker stimuli were most suppressed, while the neurons preferring the eye to which the grating stimuli were presented were least suppressed. The binocular neuron exhibited an intermediate level of suppression.

      Strengths:

      The imaging techniques are cutting-edge, and the imaging results are convincing and consistent across animals.

      Weaknesses:

      I am not totally convinced by the conclusions that the authors draw based on their machine learning models.

      Thanks for pointing this issue. We have used a new multi-class SVM suggested by the reviewer to reanalyze the data and found similar results, which is detailed later.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      (1) Lines 56-63: "As a result, the dichoptic CFS masking, which is cortical, could be substantially stronger than monocular masking when accounting for the pre-cortical effects of monocular masking." I don't quite understand this argument. Could you please elaborate?

      We have revised our writing to address the reviewer’s first major comment, which the current issue is related. The elaboration is highlighted in the paragraph below.

      “Two prominent fMRI studies have examined the impact of CFS on V1 activity (Watanabe et al., 2011; Yuval-Greenberg & Heeger, 2013). Watanabe et al. (2011) compared monocular CFS masking (stimulus visible) and dichoptic CFS masking (stimulus invisible), and reported that V1 BOLD responses were largely insensitive to stimulus visibility when attention was carefully controlled. However, using similar experimental design, Yuval-Greenberg and Heeger (2013) observed reduced BOLD responses in V1 under dichoptic masking, suggesting that V1 activity changed with stimulus visibility. They attributed the difference of results between two studies mainly to differences in statistical power (~250 trials per condition vs. ~90 trials per condition). Nevertheless, these studies were not designed to quantify the pure effect of CFS on stimulus-evoked V1 responses, as they contrasted monocular and dichoptic masking conditions to equate stimulus input while manipulating perceptual visibility. In contrast, original psychophysical studies (Tsuchiya & Koch, 2005; Tsuchiya, Koch, Gilroy, & Blake, 2006) demonstrated CFS masking by contrasting the visibility of the target stimulus with and without the presence of dichoptic mask. It is apparent that the pure CFS impact in above fMRI studies would be the difference of BOLD signals between binocular masking and stimulus alone conditions. In other words, the impact of CFS on V1 activity should be larger than what has been reported by Yuval-Greenberg and Heeger (2013).” (lines 55-71)

      (2) Line 13 low-level stimulus (properties).

      Fixed, thanks.

      Reviewer #3 (Recommendations for the authors):

      Major comments:

      (1) My main comment is regarding the SVM classifiers. The pair-wise (adjacent orientation pairs) decoding approach is unrealistic in my opinion and likely explains the very high accuracies that are reported. I believe that a multi-way classification approach - Linear Discriminant Analysis, Decision Trees, etc. - is needed to draw reasonable conclusions. Even SVMs can be adapted for multi-way classification (e.g., Allwein et al., 2000, J. Machine Learning Research).

      Following the reviewer’s advice, we reanalyzed the data using a multi-class SVM with a one-vs-one (OvO) scheme to classify 12 orientations (Allwein et al., 2000), which yielded similar results.

      “For orientation classification, we trained an all-pair multiclass support vector machine (SVM) classifier to discriminate 12 orientations based on trial-by-trial population neural responses from all trials (Allwein, Schapire, & Singer, 2000). Decoders for different FOVs, ipsilateral/contralateral target presentations, and baseline vs. CFS conditions were trained separately. Under the baseline condition, the decoders achieved mean classification accuracies of 89.5 ± 2.0% and 91.5 ± 2.1% across ipsilateral and contralateral eye conditions in Monkeys A and B, respectively, in contrast to a chance level of 8.3% (1 out of 12). Under CFS, decoding accuracy slightly decreased in Monkey A (81.7 ± 1.9%) but remained stable in Monkey B (90.4 ± 2.1%, Fig. 3A). These results suggest that under CFS, there is still sufficient information for coarse orientation discrimination, even for Monkey A whose V1 neuronal responses were substantially suppressed.” (lines 171-181)

      (2) The inconsistent modeling results (Figure 3E,F) are puzzling and need to be adequately addressed.

      SSIM and orientation error in original Fig. 3E, F measured the same reconstruction quality, but these two indices go in opposite directions for the same modeling results. To avoid confusion, we have removed the orientation error metric and now only report SSIM.

      “We used a structural similarity index (SSIM) (Brunet, Vrscay, & Wang, 2012) to quantify the reconstruction performances. Across the grating-presenting ipsilateral and contralateral eyes, the baseline models reconstructed the grating with median SSIMs of 0.52 and 0.61 for the two FOVs of Monkey A, and 0.57 and 0.63 for the two FOVs of Monkey B, respectively, while the corresponding SSIMs for the CFS models were 0.16 and 0.19 for Monkey A, and 0.55 and 0.53 for Monkey B (Fig. 3E).” (lines 200-206)

      Minor points:

      (1) The phrase "perceptual consequences" in the title is somewhat strong and possibly misleading, since there are no behavioral measures in this study.

      To address this concern from this reviewer and reviewer 1, we now focus on the impact of CSF on population orientation coding rather than perceptual consequences, which is more appropriate describing our modeling results. For example, we changed the title to: “Continuous flashing suppression of neural responses and population orientation coding in macaque V1“. Other changes are also made throughout the manuscript accordingly.

      (2) Figure 4: Panel "F" is not marked in the figure.

      Fixed, thanks.

    1. eLife Assessment

      The authors show that innate defensive behavior in mice is shaped by threat intensity, reward value, and social hierarchy, highlighting how value and social context influence instinctive decisions. The authors provide a valuable characterization of escape behavior which approximates naturalistic conditions. The evidence is incomplete due to indirect measures of vigilance and somewhat misleading characterizations of the looming stimulus.

    2. Reviewer #1 (Public review):

      This study by Li and colleagues examines how defensive responses to visual threats during foraging are modulated by both reward level and social hierarchy. Using a naturalistic paradigm, the authors test how the availability of water or sucrose, with sucrose being more rewarding than water, shapes escape behavior in mice exposed to looming stimuli of different intensities, which are used to probe perceived threat level and defensive responses. In parallel, the study compares dominant and subordinate animals to assess how social rank biases the trade off between reward seeking and threat avoidance. By combining detailed behavioral analyses with computational modeling, the work addresses how reward level and social context jointly influence escape decisions in an ethologically relevant setting.

      Across the different experimental conditions, perceived threat level is the main determinant of behavior. The authors show that looming stimuli associated with higher threat (contrast) consistently elicit faster and more robust escape responses than lower threat stimuli. This effect is particularly evident during early exposures, when animals are highly vigilant and have not yet habituated to the looming stimulus (learned that it is not dangerous). Later they described that as animals gain experience and habituate, behavior becomes more flexible, and reward level begins to exert a graded modulation of the escape response. Importantly, the authors show that under high threat conditions increasing reward value leads to more frequent and faster escape rather than greater reward pursuit. This finding is particularly relevant, as it suggests that highly valued rewards can heighten vigilance and thereby enhance responsiveness to threat, highlighting that reward does not simply compete with defensive behavior but can also reshape it depending on the perceived level of danger, in contrast to low threat conditions, where threat can be more easily outweighed by reward. Thus, an important conceptual contribution of the study is the introduction of vigilance as a useful framework to interpret these effects. Vigilance is treated as a behavioral state reflecting heightened attention to potential danger. In line with what is known from natural foraging, mice initially maintain high vigilance when confronted with an innate threat. This perspective helps clarify a finding that might otherwise appear counterintuitive. One might expect higher rewards to motivate animals to tolerate risk, explore more, and habituate faster in any scenario. Instead, the data suggest that highly rewarding outcomes can elevate vigilance, making animals more responsive to threat and leading to faster or more frequent escape under high threat conditions. In this sense, reward does not simply compete with threat but can also amplify sensitivity to it, depending on the internal state of the animal.

      The social results are particularly interesting in this context as well. Dominant mice consistently prioritize avoidance over reward, showing stronger escape responses and slower habituation than subordinates. This behavior is well captured by the vigilance framework proposed by the authors: dominant animals appear to maintain higher vigilance, which biases decisions toward threat avoidance. The authors further suggest that stable social relationships sustain high vigilance and slow habituation, framing this as an evolutionarily conserved strategy that may enhance survival. This interpretation provides a valuable perspective on how social structure shapes defensive behavior beyond immediate physical interactions. At the same time, there are important limitations to this interpretation. All experiments were conducted in male mice, and it is possible that the relationship between social hierarchy, vigilance, and defensive behavior would differ substantially in females. In addition, the idea that stable social relationships maintain elevated vigilance does not straightforwardly align with broader views of social stability as protective for mental health and as a buffer against anxiety and stress. These points do not undermine the findings but suggest that the social effects described here should be interpreted with caution and within the specific context of the task and sex studied.

      Another important limitation is that the neural mechanisms underlying these effects remain speculative. The manuscript includes an extensive discussion of candidate circuits, particularly involving the superior colliculus and downstream structures, but this section is necessarily based on prior literature rather than on data presented in the study. Given the complexity of the circuits involved in integrating internal state, reward, social context, and vigilance, the current work should be viewed as providing a strong behavioral and conceptual framework rather than direct insight into underlying neural mechanisms.

      Methodologically, the behavioral paradigm is well suited for studying escape decisions in socially housed animals, and the machine learning based classification of defensive responses is a clear strength. The computational model provides a useful formalization of how threat level, reward level, and vigilance interact and may be valuable for other laboratories studying escape, approach avoidance, or conflict situations, particularly as a way to classify behavioral outcomes after pose estimation. More generally, the work will be of interest to the neuroethology community for its detailed characterization of escape behavior under naturalistic conditions.

      Given the ethological nature of the study and the high inter individual variability reported by the authors, clarity and precision in the methods are especially important for reproducibility. While the revised manuscript addresses many earlier concerns, some aspects remain slightly difficult to follow. For example, the main text states that animals were not water deprived to avoid differences in internal state, whereas parts of the methods describe conditions in which animals were water deprived, suggesting that internal state manipulation may differ across experiments. Clearer separation and explanation of these conditions would further strengthen confidence in the work.

      Overall, this study provides a rich and thoughtful analysis of how reward level and social hierarchy modulate defensive behavior through changes in vigilance. It offers a useful conceptual advance for thinking about escape behavior in naturalistic settings and lays a solid foundation for future work aimed at linking these behavioral states to underlying neural circuits.

    3. Reviewer #2 (Public review):

      Zhe Li and colleagues investigate how mice exposed to visual threats and rewards balance their decisions in favour of consuming rewards or engaging in defensive actions. By varying threat intensity and reward value, they first confirm previous findings showing that defensive responses increase with threat intensity and that there is habituation to the threat stimulus. They then find that water-deprived mice have a reduced probability of escaping from low contrast visual looming stimuli when water or sucrose are offered in the environment, but that when the stimulus contrast is high, the presence of sucrose or water increases the probability of escape. By analysing behaviour metrics such as the latency to flee from the threat stimulus, they suggest that this increase in threat sensitivity is due to increased vigilance. Analysis of this behaviour as a function of social hierarchy shows that dominant mice have higher threat sensitivity, which is also interpreted as being due to increased vigilance. These results are captured by a drift diffusion model variant that incorporates threat intensity and reward value.

      The main contribution of this work is quantifying how the presence of water or sucrose in water-deprived mice affects escape behaviour. The differential effects of reward between the low and high contrast conditions are intriguing, but I find the interpretation that vigilance plays a major in this process not supported by the data. The idea that reward value exerts some form of graded modulation of the escape response is also not supported by the data. In addition, there is very limited methodological information, which makes assessing the quality of some of the analyses difficult, and there is no quantification on the quality of the model fits.

      (1) The main measure of vigilance in this work is reaction time. While reaction time can indeed be affected by vigilance, reaction times can vary as a function of many variables, and be different for the same level of vigilance. For example, a primate performing the random dot motion task exhibits differences in reaction times that can be explained entirely by the stimulus strength. Reaction time is therefore not a sound measure of vigilance, and if a goal of this work is to investigate this parameter, then it should be measured. There is some attempt at doing this for a subset of the data in Figure 3H, by looking at differences in the action of monitoring the visual field (presumably a rearing motion, though this is not described) between the first and second trials in the presence of sucrose. I find this an extremely contrived measure. What is the rationale for analysing only the difference between the first and second trials? Also, the results are only statistically significant because the first trial in the sucrose condition happens to have zero up action bouts, in contrast to all other conditions. I am afraid that the statistics are not solid here. When analysing the effects of dominance, a vigilance metric is the time spent in the reward zone. Why is this a measure of vigilance? More generally, measuring vigilance of threats in mice requires monitoring the position of the eyes, which previous work has shown is biased to the upper visual field, consistent with the threat ecology of rodents.

      (2) In both low and high contrast conditions, there are differences in escape behaviour between no reward and water or sucrose presence, but no statistically significant differences between water and sucrose (eg: Figure 3B). I therefore find that statements about reward value are not supported by the data, which only show differences between the presence or absence of reward. Furthermore, there is a confound in these experiments, because according to the methods, mice in the no-reward condition were not water-deprived. It is thus possible that the differences in behaviour arise from differences in the underlying state.

      (3) There is very little methodological information on behavioural quantification. For example, what is hiding latency? Is this the same are reaction time? Time to reach the safe zone? What exactly is distance fled? I don't understand how this can vary between 20 and 100cm. Presumably, the 20cm flights don't reach the safe place, since the threat is roughly at the same location for each trial? How is the end of a flight determined? How is duration measured in reward zone measures, e.g., from when to when? How is fleeing onset determined?

      (4) There is little methodological information on how the model was fit (for example, it is surprising that in the no reward condition, the r parameter is exactly 0. What this constrained in any way), and none of the fit parameters have uncertainty measures so it is not possible to assess whether there are actually any differences in parameters that are statistically significant.

      Comments on the revised manuscript:

      The manuscript has been revised and improved significantly by the addition of methodological details and new analysis. I remain, however, unconvinced by the argument that increased vigilance in the presence of reward leads to heightened escape behaviour.

      In response to my criticism that the work does not measure vigilance directly, the authors have included measures of foraging interval and foraging speed, which they state are "two direct behavioral analyses of vigilance". I disagree - like reaction time, foraging speed and foraging interval can be modulated, for example, by changes in threat sensitivity. Increased threat sensitivity comes with diverse behavioral changes that may well include increased vigilance, but foraging interval and foraging speed can certainly change without the animal expressing increased vigilance behaviors. A bigger issue I still have though, is with the conclusion that the presence of reward increases "direct escape behaviors". Comparing the no reward, water and sucrose groups indeed shows a difference (which is now clear after the split into early and late phases), but the issue is that these are different mice. As the text is written, is sounds like introducing reward will acutely increase escape. But if we look at the raw data show in Figure 2C, what I think is happening is that the presence of reward is decreasing habituation to the stimulus. The data for trials 1 and 10 in the three conditions show this - there is habituation with no reward (reaction times are all shifting to the right), a bit less with water and very little with sucrose. This is interesting in its own right and we can speculate why it might be happening, but I think this is conceptually different from what the authors are proposing.

    4. Reviewer #3 (Public review):

      Male mice were tested in a classic behavioral "flee the looming stimulus" paradigm. This is a purely behavioral study; no neural analyses were done. Mice were housed socially, but faced the looming stimulus individually, using an elegant automated tunnel (see videos for clarity).

      The additional changes made to the paper clarify the work done. While there are some limitations (male mice, weird stimulus), the general results are interesting and a valuable addition to the experimental literature. The main claim of the paper is that the different rewards (none, water, sucrose) did not change the escape properties early in learning, but did late, particularly that in the late (already experienced) conditions, reward value (assuming sucrose > water > no reward) interacted with the salience of the looming stimulus (light gray, dark gray). (Panels 3D, 3G, 3K, 3N).

      For readers, I want to note that one of the most interesting results is actually in Figure S2, where they find that a looming stimulus behind the mouse still makes a mouse run to the nest. In these conditions, the mouse runs past the looming stimulus to get to safety! (I also do love the video of the mouse running around the barriers like a snake to get home.)

      I have a few minor clarification questions and a few notes that I think would be useful additions for authors and readers to think about.

      Dominance: What does the mouse social science literature say about the "test tube" test? What can we conclude from this test? This would be useful when trying to understand what is causing the dominance/submissive difference in responses. Figure 4 shows that the dominant mice are more risk-averse than the submissive mice. Is "dominance" in the test-tube actually a measure of risk-seeking? Is the issue that the submissive mice don't think they can get back to the food-site easily, so they are less willing to sacrifice the current (if dangerous) foraging opportunity? Is the issue that the submissive mice can't get back to the nest? As I understand it, the nest was always available to all the mice, so I suspect inability to get to the nest is an unlikely hypotheses. Is the issue that the submissive mice also don't feel safe in the nest?

      Limitations of the study: There is an acknowledged limitation to male mice, and the limitations of the small data sets that are typical of such experiments. In addition, however, it is also worth noting the strangeness of the looming stimulus, which is revealed clearly in the videos. The stimulus is a repeating growing circle, growing in a single location within the environment. The stimulus repeats 10 times, once per second. This is not what an attacking hawk or owl would look like. (I now have this image of an owl diving down, and then teleporting up and diving down again.) Note - I am fine with this stimulus. It produces an interesting experiment and interesting results. I do not think the authors need to change anything in their paper, but readers need to recognize that this is not a "looming predator".

      These "limitations" are better seen as "caveats" when folding these results in with the rest of the literature that has gone before and the literature to come. (Generally, I do not believe that science works by studies making discoveries that change how we think about problems - instead, science works by studies adding to the literature that we integrate in with the rest of the literature.) Thus, these caveats should not be taken as problems with the study or as fixes that need to be done. Instead, they are notes for future researchers to notice if differences are found in any future studies.

      Thus, my only suggestion is that I think authors could write a more careful paper by using the past and subjunctive tense appropriately. Experimental observations should be in past tense, as in "the influence of reward was context-dependent and emerged in the late phase" instead of "the influence of reward is context-dependent and emerges in the late phase" - it emerged in the late phase this once - it might not in future experiments, not due to any fault in this experiment nor due to replicability problems, but rather due to unexpected differences between this and those future experiments. At which point, it will be up to those future experiments to determine the difference. Similarly, large conclusions should be in the subjunctive tense, as in "these data suggest that threat intensity is likely to be the primary determinant of decision making" rather than "threat intensity is the primary determinant of decision making", because those are hypotheses not facts.

    5. Author response:

      The following is the authors’ response to the current reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      This study by Li and colleagues examines how defensive responses to visual threats during foraging are modulated by both reward level and social hierarchy. Using a naturalistic paradigm, the authors test how the availability of water or sucrose, with sucrose being more rewarding than water, shapes escape behavior in mice exposed to looming stimuli of different intensities, which are used to probe perceived threat level and defensive responses. In parallel, the study compares dominant and subordinate animals to assess how social rank biases the trade off between reward seeking and threat avoidance. By combining detailed behavioral analyses with computational modeling, the work addresses how reward level and social context jointly influence escape decisions in an ethologically relevant setting.

      Across the different experimental conditions, perceived threat level is the main determinant of behavior. The authors show that looming stimuli associated with higher threat (contrast) consistently elicit faster and more robust escape responses than lower threat stimuli. This effect is particularly evident during early exposures, when animals are highly vigilant and have not yet habituated to the looming stimulus (learned that it is not dangerous). Later they described that as animals gain experience and habituate, behavior becomes more flexible, and reward level begins to exert a graded modulation of the escape response. Importantly, the authors show that under high threat conditions increasing reward value leads to more frequent and faster escape rather than greater reward pursuit. This finding is particularly relevant, as it suggests that highly valued rewards can heighten vigilance and thereby enhance responsiveness to threat, highlighting that reward does not simply compete with defensive behavior but can also reshape it depending on the perceived level of danger, in contrast to low threat conditions, where threat can be more easily outweighed by reward. Thus, an important conceptual contribution of the study is the introduction of vigilance as a useful framework to interpret these effects. Vigilance is treated as a behavioral state reflecting heightened attention to potential danger. In line with what is known from natural foraging, mice initially maintain high vigilance when confronted with an innate threat. This perspective helps clarify a finding that might otherwise appear counterintuitive. One might expect higher rewards to motivate animals to tolerate risk, explore more, and habituate faster in any scenario. Instead, the data suggest that highly rewarding outcomes can elevate vigilance, making animals more responsive to threat and leading to faster or more frequent escape under high threat conditions. In this sense, reward does not simply compete with threat but can also amplify sensitivity to it, depending on the internal state of the animal.

      The social results are particularly interesting in this context as well. Dominant mice consistently prioritize avoidance over reward, showing stronger escape responses and slower habituation than subordinates. This behavior is well captured by the vigilance framework proposed by the authors: dominant animals appear to maintain higher vigilance, which biases decisions toward threat avoidance. The authors further suggest that stable social relationships sustain high vigilance and slow habituation, framing this as an evolutionarily conserved strategy that may enhance survival. This interpretation provides a valuable perspective on how social structure shapes defensive behavior beyond immediate physical interactions. At the same time, there are important limitations to this interpretation. All experiments were conducted in male mice, and it is possible that the relationship between social hierarchy, vigilance, and defensive behavior would differ substantially in females. In addition, the idea that stable social relationships maintain elevated vigilance does not straightforwardly align with broader views of social stability as protective for mental health and as a buffer against anxiety and stress. These points do not undermine the findings but suggest that the social effects described here should be interpreted with caution and within the specific context of the task and sex studied.

      We thank the reviewer for raising this important point. In the context of repeated looming exposure, slower habituation reflects more sustained vigilance over time. Compared to individually housed mice, group-housed mice exhibit slower habituation (Lenz et al., 2022), and pair-housed mice showed even slower habituation in our current work. Importantly, this pattern does not indicate that pair-housed mice have higher overall vigilance than individually housed animals. Although individually housed mice habituate more quickly, they display higher initial vigilance, as reflected by their increased probability of escaping in response to looming stimuli (Lenz et al., 2022). Thus, pair-housed mice exhibited reduced defensive responses compared to individually housed animals, consistent with a social buffering effect.

      Furthermore, in a separate study (Rank- and Threat-Dependent Social Modulation of Innate Defensive Behaviors; Li, Gao, Li, 2026, eLife 15:RP109571), we directly compared responses to looming stimuli when mice were tested alone versus in the presence of a social partner and observed clear evidence of social buffering.

      Another important limitation is that the neural mechanisms underlying these effects remain speculative. The manuscript includes an extensive discussion of candidate circuits, particularly involving the superior colliculus and downstream structures, but this section is necessarily based on prior literature rather than on data presented in the study. Given the complexity of the circuits involved in integrating internal state, reward, social context, and vigilance, the current work should be viewed as providing a strong behavioral and conceptual framework rather than direct insight into underlying neural mechanisms.

      We fully agree that the proposed neural mechanisms remain speculative and that the circuits involved in integrating internal state, reward, and social context are likely far more complex. We have revised the manuscript to acknowledge this limitation.

      Methodologically, the behavioral paradigm is well suited for studying escape decisions in socially housed animals, and the machine learning based classification of defensive responses is a clear strength. The computational model provides a useful formalization of how threat level, reward level, and vigilance interact and may be valuable for other laboratories studying escape, approach avoidance, or conflict situations, particularly as a way to classify behavioral outcomes after pose estimation. More generally, the work will be of interest to the neuroethology community for its detailed characterization of escape behavior under naturalistic conditions.

      Given the ethological nature of the study and the high inter individual variability reported by the authors, clarity and precision in the methods are especially important for reproducibility. While the revised manuscript addresses many earlier concerns, some aspects remain slightly difficult to follow. For example, the main text states that animals were not water deprived to avoid differences in internal state, whereas parts of the methods describe conditions in which animals were water deprived, suggesting that internal state manipulation may differ across experiments. Clearer separation and explanation of these conditions would further strengthen confidence in the work.

      To improve clarity, we have revised the Methods section to clearly distinguish between experimental conditions that involved water deprivation and those that did not.

      Overall, this study provides a rich and thoughtful analysis of how reward level and social hierarchy modulate defensive behavior through changes in vigilance. It offers a useful conceptual advance for thinking about escape behavior in naturalistic settings and lays a solid foundation for future work aimed at linking these behavioral states to underlying neural circuits.

      Reviewer #2 (Public review):

      Zhe Li and colleagues investigate how mice exposed to visual threats and rewards balance their decisions in favour of consuming rewards or engaging in defensive actions. By varying threat intensity and reward value, they first confirm previous findings showing that defensive responses increase with threat intensity and that there is habituation to the threat stimulus. They then find that water-deprived mice have a reduced probability of escaping from low contrast visual looming stimuli when water or sucrose are offered in the environment, but that when the stimulus contrast is high, the presence of sucrose or water increases the probability of escape. By analysing behaviour metrics such as the latency to flee from the threat stimulus, they suggest that this increase in threat sensitivity is due to increased vigilance. Analysis of this behaviour as a function of social hierarchy shows that dominant mice have higher threat sensitivity, which is also interpreted as being due to increased vigilance. These results are captured by a drift diffusion model variant that incorporates threat intensity and reward value.

      The main contribution of this work is quantifying how the presence of water or sucrose in water-deprived mice affects escape behaviour. The differential effects of reward between the low and high contrast conditions are intriguing, but I find the interpretation that vigilance plays a major in this process not supported by the data. The idea that reward value exerts some form of graded modulation of the escape response is also not supported by the data. In addition, there is very limited methodological information, which makes assessing the quality of some of the analyses difficult, and there is no quantification on the quality of the model fits.

      (1) The main measure of vigilance in this work is reaction time. While reaction time can indeed be affected by vigilance, reaction times can vary as a function of many variables, and be different for the same level of vigilance. For example, a primate performing the random dot motion task exhibits differences in reaction times that can be explained entirely by the stimulus strength. Reaction time is therefore not a sound measure of vigilance, and if a goal of this work is to investigate this parameter, then it should be measured. There is some attempt at doing this for a subset of the data in Figure 3H, by looking at differences in the action of monitoring the visual field (presumably a rearing motion, though this is not described) between the first and second trials in the presence of sucrose. I find this an extremely contrived measure. What is the rationale for analysing only the difference between the first and second trials? Also, the results are only statistically significant because the first trial in the sucrose condition happens to have zero up action bouts, in contrast to all other conditions. I am afraid that the statistics are not solid here. When analysing the effects of dominance, a vigilance metric is the time spent in the reward zone. Why is this a measure of vigilance? More generally, measuring vigilance of threats in mice requires monitoring the position of the eyes, which previous work has shown is biased to the upper visual field, consistent with the threat ecology of rodents.

      We agree that reaction time can be influenced by multiple factors, including stimulus strength. Consistent with this, reaction times (i.e. latencies to flee) were substantially shorter under high-contrast conditions (Figure 3E). However, even under the same high-contrast condition, reaction times were significantly shorter in the water condition compared to the no-reward condition, suggesting that other factors such as vigilance may contribute.

      Upward-directed attention includes rearing, up-stretching, and upward head orientation, which will be clarified in the Method section. To address concerns about statistical validity, we will quantify these behaviors across the first 10 trials rather than limiting the analysis to the first two.

      As for the dominance-related results, we interpret them as reflecting both enhanced vigilance and reduced reward-seeking behavior. Time spent in the reward zone is not a measure of vigilance but an indicator of reward-seeking motivation. We will clarify this in the revised manuscript.

      (2) In both low and high contrast conditions, there are differences in escape behaviour between no reward and water or sucrose presence, but no statistically significant differences between water and sucrose (eg: Figure 3B). I therefore find that statements about reward value are not supported by the data, which only show differences between the presence or absence of reward. Furthermore, there is a confound in these experiments, because according to the methods, mice in the no-reward condition were not water-deprived. It is thus possible that the differences in behaviour arise from differences in the underlying state.

      In Figure 3B, the difference between water and sucrose conditions did not reach statistical significance (p = 0.08). We plan to collect additional data to determine whether this is due to limited statistical power. It is also possible that some behavioral readouts are more sensitive to the differences between water and sucrose conditions. For example, Figure 3F shows that escape speed was significantly higher in the sucrose than in the water condition under high-contrast stimulation.

      Thank you for pointing this out. To control for the potential confounds related to internal state, mice were not water-deprived under any of the three conditions in Figures 3A-3H. We will clarify this in the main text and Methods. For Figures 3I-3M, which compare decision-making under no-reward and water conditions, we will conduct additional experiments using non-deprived mice in the water condition.

      (3) There is very little methodological information on behavioural quantification. For example, what is hiding latency? Is this the same are reaction time? Time to reach the safe zone? What exactly is distance fled? I don't understand how this can vary between 20 and 100cm. Presumably, the 20cm flights don't reach the safe place, since the threat is roughly at the same location for each trial? How is the end of a flight determined? How is duration measured in reward zone measures, e.g., from when to when? How is fleeing onset determined?

      Hiding latency was defined as the time from stimulus onset to the animal’s arrival at the safe zone. Reaction time was quantified as the latency to flee, measured from stimulus onset to the initiation of the first flight state. The flight state was defined as locomotion exceeding 10 cm at a speed greater than 10 cm/s. Distance fled was defined as the distance covered between stimulus onset and offset for all trials. However, in trials classified as no reaction or freezing, this measure does not accurately reflect escape behavior. We will therefore rename it as distance under threat to better capture its meaning. The reward zone was defined as the region within 15 cm of the reward port at the end of the arena. Duration in the reward zone was measured as the time spent within this region during the 20 seconds following stimulus onset. In Figure 4E, the percentage of time spent in the reward zone was calculated relative to the total time the mouse remained in the arena during the 2-hour social session.

      All definitions and additional details on behavioral quantification will be included in the revised Methods section.

      (4) There is little methodological information on how the model was fit (for example, it is surprising that in the no reward condition, the r parameter is exactly 0. What this constrained in any way), and none of the fit parameters have uncertainty measures so it is not possible to assess whether there are actually any differences in parameters that are statistically significant.

      We appreciate the comment and agree that further clarification is needed. We will provide a more detailed description of the model fitting procedure in the revised Methods section. Specifically, the drift rate parameter (r), which reflects the perceived reward value, was constrained to zero in the no-reward condition. To enable statistical comparison across conditions, we will report uncertainty measures for all fit parameters.

      Comments on the revised manuscript:

      The manuscript has been revised and improved significantly by the addition of methodological details and new analysis. I remain, however, unconvinced by the argument that increased vigilance in the presence of reward leads to heightened escape behaviour.

      In response to my criticism that the work does not measure vigilance directly, the authors have included measures of foraging interval and foraging speed, which they state are "two direct behavioral analyses of vigilance". I disagree - like reaction time, foraging speed and foraging interval can be modulated, for example, by changes in threat sensitivity. Increased threat sensitivity comes with diverse behavioral changes that may well include increased vigilance, but foraging interval and foraging speed can certainly change without the animal expressing increased vigilance behaviors. A bigger issue I still have though, is with the conclusion that the presence of reward increases "direct escape behaviors". Comparing the no reward, water and sucrose groups indeed shows a difference (which is now clear after the split into early and late phases), but the issue is that these are different mice. As the text is written, is sounds like introducing reward will acutely increase escape. But if we look at the raw data show in Figure 2C, what I think is happening is that the presence of reward is decreasing habituation to the stimulus. The data for trials 1 and 10 in the three conditions show this - there is habituation with no reward (reaction times are all shifting to the right), a bit less with water and very little with sucrose. This is interesting in its own right and we can speculate why it might be happening, but I think this is conceptually different from what the authors are proposing.

      We agree that vigilance is not directly observable as a single variable. Our intent was not to claim that foraging speed and foraging interval provide a direct measure of vigilance, but rather to suggest that they may serve as indirect behavioral correlates.

      We also considered an alternative interpretation: these two measures could reflect perceived reward value under high-threat conditions across distinct reward types. If that were the case, animals would be expected to exhibit shorter intervals and faster speeds across no reward, water, and sucrose conditions. However, our data do not support this interpretation (Figures 3L and 3M), suggesting that these measures are more likely correlated with vigilance. 

      Furthermore, it is unlikely that changes in foraging interval and speed are driven by altered threat sensitivity, as animals could not see the threat during most of the foraging bout and only encountered it at the end.

      Regarding the conclusion that the presence of reward increases direct escape behaviors, our interpretation is that increased reward value reduces habituation, thereby maintaining higher vigilance during the late phase. This was discussed in the second-to-last paragraph of the "Economic and social modulations of innate decision-making under threat" subsection in the Discussion.

      Reviewer #3 (Public review):

      Male mice were tested in a classic behavioral "flee the looming stimulus" paradigm. This is a purely behavioral study; no neural analyses were done. Mice were housed socially, but faced the looming stimulus individually, using an elegant automated tunnel (see videos for clarity).

      The additional changes made to the paper clarify the work done. While there are some limitations (male mice, weird stimulus), the general results are interesting and a valuable addition to the experimental literature. The main claim of the paper is that the different rewards (none, water, sucrose) did not change the escape properties early in learning, but did late, particularly that in the late (already experienced) conditions, reward value (assuming sucrose > water > no reward) interacted with the salience of the looming stimulus (light gray, dark gray). (Panels 3D, 3G, 3K, 3N).

      For readers, I want to note that one of the most interesting results is actually in Figure S2, where they find that a looming stimulus behind the mouse still makes a mouse run to the nest. In these conditions, the mouse runs past the looming stimulus to get to safety! (I also do love the video of the mouse running around the barriers like a snake to get home.)

      I have a few minor clarification questions and a few notes that I think would be useful additions for authors and readers to think about.

      Dominance: What does the mouse social science literature say about the "test tube" test? What can we conclude from this test? This would be useful when trying to understand what is causing the dominance/submissive difference in responses. Figure 4 shows that the dominant mice are more risk-averse than the submissive mice. Is "dominance" in the test-tube actually a measure of risk-seeking? Is the issue that the submissive mice don't think they can get back to the food-site easily, so they are less willing to sacrifice the current (if dangerous) foraging opportunity? Is the issue that the submissive mice can't get back to the nest? As I understand it, the nest was always available to all the mice, so I suspect inability to get to the nest is an unlikely hypotheses. Is the issue that the submissive mice also don't feel safe in the nest?

      The tube test is a widely used assay in the rodent social behavior literature to assess dominance hierarchies, operationally defined by the ability of one animal to force its opponent to retreat from a narrow tube. Importantly, this assay does not directly measure risk-seeking or anxiety-related traits, but rather competitive outcomes during social conflict. Furthermore, our data indicate that the behavioral responses of subordinate mice to looming stimuli are primarily driven by the visual threat itself rather than by social avoidance. This point was elaborated in the second paragraph of the “Social modulation of innate decision-making” subsection in the Results section.

      Limitations of the study: There is an acknowledged limitation to male mice, and the limitations of the small data sets that are typical of such experiments. In addition, however, it is also worth noting the strangeness of the looming stimulus, which is revealed clearly in the videos. The stimulus is a repeating growing circle, growing in a single location within the environment. The stimulus repeats 10 times, once per second. This is not what an attacking hawk or owl would look like. (I now have this image of an owl diving down, and then teleporting up and diving down again.) Note - I am fine with this stimulus. It produces an interesting experiment and interesting results. I do not think the authors need to change anything in their paper, but readers need to recognize that this is not a "looming predator".

      These "limitations" are better seen as "caveats" when folding these results in with the rest of the literature that has gone before and the literature to come. (Generally, I do not believe that science works by studies making discoveries that change how we think about problems - instead, science works by studies adding to the literature that we integrate in with the rest of the literature.) Thus, these caveats should not be taken as problems with the study or as fixes that need to be done. Instead, they are notes for future researchers to notice if differences are found in any future studies.

      Thus, my only suggestion is that I think authors could write a more careful paper by using the past and subjunctive tense appropriately. Experimental observations should be in past tense, as in "the influence of reward was context-dependent and emerged in the late phase" instead of "the influence of reward is context-dependent and emerges in the late phase" - it emerged in the late phase this once - it might not in future experiments, not due to any fault in this experiment nor due to replicability problems, but rather due to unexpected differences between this and those future experiments. At which point, it will be up to those future experiments to determine the difference. Similarly, large conclusions should be in the subjunctive tense, as in "these data suggest that threat intensity is likely to be the primary determinant of decision making" rather than "threat intensity is the primary determinant of decision making", because those are hypotheses not facts.

      We thank the reviewer for the helpful suggestions and have revised the Abstract accordingly.


      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      This study investigates how mice make defensive decisions when exposed to visual threats and how those decisions are influenced by reward value and social hierarchy. Using a naturalistic foraging setup and looming stimuli, the authors show that higher threat leads to faster escape, while lower threat allows mice to weigh reward value. Dominant mice behave more cautiously, showing higher vigilance. The behavioral findings are further supported by a computational model aimed at capturing how different factors shape decisions.

      Strengths:

      (1) The behavioral paradigm is well-designed and ethologically relevant, capturing instinctive responses in a controlled setting.

      (2) The paper addresses an important question: how defensive behaviors are influenced by social and value-based factors.

      (3) The classification of behavioral responses using machine learning is a solid methodological choice that improves reproducibility.

      Weaknesses:

      (1) Key parts of the methods are hard to follow, especially how trials are selected and whether learning across trials is fully controlled for. For example, it is unclear whether animals are in the nest during the looming stimulus presentations. The main text and methods should clarify whether multiple mice are in the nest simultaneously and whether only one mouse is in the arena during looming exposure. From the description, it seems that all mice may be freely exploring during some phases, but only one is allowed in the arena at a time during stimulus presentation. This point is important for understanding the social context and potential interactions, and should be clearly explained in both the main text and methods.

      We agree that these details are essential and have clarified them in the Methods. When the door system operated normally, only one mouse was allowed in the arena during looming exposure. Specifically, when all mice were in the nest, the nest-tunnel door was open and the tunnel-arena door was closed. Once a single mouse entered the tunnel, as detected by an OpenMV camera, the nest-tunnel door closed and the tunnel-arena opened, ensuring that only that mouse could enter the arena.

      Habituation was conducted over two days. On day 1, five mice were placed together in the nest for 30 minutes with all doors closed. Each mouse was then placed individually in the nest and allowed to freely explore the arena for 10 minutes under normal door operation. Finally, all mice were returned to the nest with all doors open and allowed for free exploration for 2 hours. On day 2, each mouse was placed individually in the nest and given an additional 1 hour of exploration under normal door operation.

      (2) It is often unclear whether the data shown (especially in the main summary figures) come from the first trial or are averages across several exposures. When is the cut-off for trials of each animal? How do we know how many trial presentations were considered, and how learning at different rates between individuals is taken into account when plotting all animals together? This is important because the looming stimulus is learned to be harmless very quickly, so the trial number strongly affects interpretation.

      We observed substantial inter-individual variability in habituation to looming stimuli, with a sharp decline in defensive responses over the first few trials followed by more gradual changes. To account for this, we segmented trials for each animal into two phases: an early rapidhabituation phase and a later stable phase. Analyzing these phases separately revealed that threat intensity dominates behavior in the early phase, whereas both threat and reward significantly influence behavior in the late phase. These results are now presented in revised Figures 2 and 3. Analyses restricted to first trials are included in Figure S5.

      (3) The reward-related effects are difficult to interpret without a clearer separation of learning vs first responses.

      As noted above, we have re-analyzed our data to account for learning effects.

      (4) The model reproduces observed patterns but adds limited explanatory or predictive power. It does not integrate major findings like social hierarchy. Its impact would be greatly improved if the authors used it to predict outcomes under novel or intermediate conditions.

      We have substantially revised the modeling analysis. The model is now fitted to behavioral data from the late phase and used to predict outcomes across additional conditions, including the early phase behavior and rank-dependent behavioral differences. The model successfully captures behavioral patterns across these conditions, supporting its predictive value beyond descriptive fitting.

      (5) Some conclusions (e.g., about vigilance increasing with reward) are counterintuitive and need stronger support or alternative explanations. Regarding the interpretation of social differences in area coverage, it's also possible that the observed behavioral differences reflect access to the nesting space. Dominant mice may control the nest, forcing subordinates to remain in the open arena even during or after looming stimuli. In this case, subordinates may be choosing between the threat of the dominant mouse and the external visual threat. The current data do not distinguish between these possibilities, and the authors do not provide evidence to support one interpretation over the other. Including this alternative explanation or providing data that addresses it would strengthen the conclusions.

      To support the interpretation of increased vigilance with reward under high-threat conditions, we analyzed additional behavioral measures beyond latency to flee. Rewarded mice showed longer foraging interval and slower foraging speed, both consistent with elevated vigilance (Figures 3L and 3M).

      To address the alternative explanation that subordinate mice may remain in the arena due to restricted nest access, we compared arena occupancy before, during, and after looming exposure. Although subordinates spent more time in the arena before looming, this difference disappeared during and after looming exposure (Figures 4C). Moreover, dominant and subordinate mice were

      equally likely to flee to the nest during escape trials. These findings rule out nest access restrictions as an explanation for the observed rank-dependent differences in defensive behaviors.

      (6) While potential neural circuits are mentioned in the discussion, an earlier introduction of candidate brain regions and their relevance to threat and value processing would help ground the study in existing systems neuroscience.

      We have revised the Introduction to incorporate relevant brain regions and neural circuits.

      (7) Some figures are difficult to interpret without clearer trial/mouse labeling, and a few claims in the text are stronger than what the data fully support. Figure 3H is done for low contrast, but the interesting findings will be to do this experiment with high contrast. Figure 4H - I don't understand this part. If the amount of time in the center after the loom changes for subordinate mice, how does this lead to the conclusion that they spend most of their time in the reward zone?. Figure 3A - The example shown does not seem representative of the claim that high contrast stimuli are more likely to trigger escape. In particular, the 10% sucrose condition appears to show more arena visits under low contrast than high contrast, which seems to contradict that interpretation. Also, the plot currently uses trials on the Y-axis, but it would be more informative to show one line per animal, using only the first trial for each. This would help separate initial threat responses from learning effects and clarify individual variability.

      We have substantially revised the figures. Results from trial segmentation based on individual habituation are now explicitly presented in Figures 2 and 3, and analyses using only the first trials are provided in Figure S5 to separate initial responses from learning effects.

      Regarding the original Figure 4H, we are not entirely certain about the concern. In this panel, we measured time spent in the reward zone, which is defined as the region within 10 cm of the reward port at the end of the arena, not the center of the arena, during looming exposure. Subordinate mice spent significantly more time in the reward zone than dominant mice. We have further clarified this in the revised manuscript.

      (8) The analysis does not explore individual variability in behavior, which could be an important source of structure in the data. Without this, it is difficult to know whether social hierarchy alone explains behavioral differences or if other stable traits (e.g., anxiety level, prior experiences) also contribute.

      We observed substantial individual variability in both dominant and subordinate mice, even on the first trial (Figure S7). Paired dominant–subordinate comparisons were used to isolate rankdependent effects.

      (9) The study shows robust looming responses in group-housed animals, which contrasts with other studies that often require single housing to elicit reliable defensive responses. It would be valuable for the authors to discuss why their results differ in this regard and whether housing conditions might interact with social rank or habituation.

      Robust looming-evoked defensive responses have been reported in both group- and singlehoused mice (Yilmaz and Meister, 2013, Lenzi et al., 2022), although single-housed mice habituate more rapidly. We have now discussed the potential interactions between housing conditions, social rank, and habituation in defensive behaviors in the revised manuscript.

      Reviewer #2 (Public review):

      Zhe Li and colleagues investigate how mice exposed to visual threats and rewards balance their decisions in favour of consuming rewards or engaging in defensive actions. By varying threat intensity and reward value, they first confirm previous findings showing that defensive responses increase with threat intensity and that there is habituation to the threat stimulus. They then find that water-deprived mice have a reduced probability of escaping from low contrast visual looming stimuli when water or sucrose are offered in the environment, but that when the stimulus contrast is high, the presence of sucrose or water increases the probability of escape. By analysing behaviour metrics such as the latency to flee from the threat stimulus, they suggest that this increase in threat sensitivity is due to increased vigilance. Analysis of this behaviour as a function of social hierarchy shows that dominant mice have higher threat sensitivity, which is also interpreted as being due to increased vigilance. These results are captured by a drift diffusion model variant that incorporates threat intensity and reward value.

      The main contribution of this work is to quantify how the presence of water or sucrose in waterdeprived mice affects escape behaviour. The differential effects of reward between the low and high contrast conditions are intriguing, but I find the interpretation that vigilance plays a major role in this process is not supported by the data. The idea that reward value exerts some form of graded modulation of the escape response is also not supported by the data. In addition, there is very limited methodological information, which makes assessing the quality of some of the analyses difficult, and there is no quantification of the quality of the model fits.

      (1) The main measure of vigilance in this work is reaction time. While reaction time can indeed be affected by vigilance, reaction times can vary as a function of many variables, and be different for the same level of vigilance. For example, a primate performing the random dot motion task exhibits differences in reaction times that can be explained entirely by the stimulus strength. Reaction time is therefore not a sound measure of vigilance, and if a goal of this work is to investigate this parameter, then it should be measured. There is some attempt at doing this for a subset of the data in Figure 3H, by looking at differences in the action of monitoring the visual field (presumably a rearing motion, though this is not described) between the first and second trials in the presence of sucrose. I find this an extremely contrived measure. What is the rationale for analysing only the difference between the first and second trials? Also, the results are only statistically significant because the first trial in the sucrose condition happens to have zero up action bouts, in contrast to all other conditions. I am afraid that the statistics are not solid here. When analysing the effects of dominance, a vigilance metric is the time spent in the reward zone. Why is this a measure of vigilance? More generally, measuring vigilance of threats in mice requires monitoring the position of the eyes, which previous work has shown is biased to the upper visual field, consistent with the threat ecology of rodents.

      We agree that reaction time can be influenced by multiple factors, including stimulus strength. Consistent with this, reaction times (i.e. latencies to flee) were substantially shorter under highcontrast conditions. However, even under the same high-contrast condition, reaction times were significantly shorter in the reward conditions compared to the no-reward condition, suggesting that other factors such as vigilance may contribute.

      Regarding the measurement of vigilance, in addition to the latency to flee, we analyzed two additional behavioral measures related to vigilance. First, we examined the foraging interval. Our hypothesis was that more vigilant animals would wait longer before re-entering the reward zone following threat exposure. Consistent with this prediction, mice under sucrose and water reward conditions showed significantly longer foraging intervals than those under no-reward conditions (Figure 3L). Second, we analyzed the foraging speed as mice approached the reward. Increased vigilance should lead to more cautious and therefore slower movements. Our results support this, as mice moved more slowly towards the reward under sucrose conditions (Figure 3M). Taken together, these three measures consistently indicate that mice exhibit increased vigilance under sucrose reward in high-threat conditions.

      (2) In both low and high contrast conditions, there are differences in escape behaviour between no reward and water or sucrose presence, but no statistically significant differences between water and sucrose (eg, Figure 3B). I therefore find that statements about reward value are not supported by the data, which only show differences between the presence or absence of reward. Furthermore, there is a confound in these experiments, because according to the methods, mice in the no-reward condition were not water deprived. It is thus possible that the differences in behaviour arise from differences in the underlying state.

      Our new analysis, which segments behavior into an early adaptive phase and a late stable phase, reveals a statistically significant difference between water and sucrose rewards in the late phase (Figure 3H), supporting a graded effect of reward value.

      To control for the potential confounds related to internal state, mice were not water-deprived in all reward conditions. We have clarified this in the revised manuscript.

      (3) There is very little methodological information on behavioural quantification. For example, what is hiding latency? Is this the same are reaction time? Time to reach the safe zone? What exactly is distance fled? I don't understand how this can vary between 20 and 100cm. Presumably, the 20cm flights don't reach the safe place, since the threat is roughly at the same location for each trial? How is the end of a flight determined? How is duration measured in reward zone measures, e.g., from when to when? How is fleeing onset determined?

      Hiding latency was defined as the time from stimulus onset to the animal’s arrival at the safe zone. Reaction time was quantified as the latency to flee, measured from stimulus onset to the initiation of the first flight state. The flight state was defined as locomotion exceeding 10 cm at a speed greater than 10 cm/s. Distance fled was defined as the distance covered between stimulus onset and offset for all trials. However, in trials classified as no reaction or freezing, this measure does not accurately reflect escape behavior. We will therefore rename it as distance under threat to better capture its meaning. The reward zone was defined as the region within 10 cm of the reward port at the end of the arena. Duration in the reward zone was measured as the time spent within this region during the 20 seconds following stimulus onset. In Figure 4E, the percentage of time spent in the reward zone was calculated relative to the total time the mouse remained in the arena during the 2-hour social session.

      All definitions and additional details on behavioral quantification have been included in the revised Methods section.

      (4) There is little methodological information on how the model was fit (for example, it is surprising that in the no reward condition, the r parameter is exactly 0. What this constrained in any way), and none of the fit parameters have uncertainty measures so it is not possible to assess whether there are actually any differences in parameters that are statistically significant.

      We have provided a detailed description of the model fitting procedure in the revised Methods section. Specifically, the reward-value parameter (r) was constrained to zero in the no-reward condition. We have plotted how the overall loss varies with differeent parameters (Figure S9).

      Reviewer #3 (Public review):

      Male mice were tested in a classic behavioral "flee the looming stimulus" paradigm. This is a purely behavioral study; no neural analyses were done. Mice were housed socially, but faced the looming stimulus individually. Drift-diffusion modeling found that reward-level interacted with threat level such that at low-threat levels, reward contrasted with threat as classically expected (high reward overwhelms low threat, low threat overwhelms low reward), but that reward aligned with threat at higher threat levels.

      Note that they define threat level by the darkness of the looming stimulus. I am not sure that darker stimuli are more threatening to mice. But maybe. Figure 3 shows that mice react more quickly to high contrast looming stimuli, but can the authors distinguish between the ability to detect the visual signal from considering it a more dangerous threat? (The fact that vigilance makes a difference in the high contrast condition, not the low contrast condition, actually supports the author's hypotheses here.)

      Regarding the interpretation of stimulus contrast as a proxy for threat level, we agree it is crucial to distinguish improved detection from heightened threat perception. To address this, we examined not only latency to flee but also escape distance and peak escape speed, two measures that reflect the intensity of the defensive response. If contrast only influenced detection, we would expect differences in latency but not in escape distance or speed. All three measures differed significantly across contrast conditions, supporting the interpretation that high-contrast stimuli are perceived as more threatening rather than simply more detectable. Furthermore, manual review of "no response" trials confirmed reliable detection in both conditions, with only three potential "missed" trials out of 117 under low contrast (Figure S3B). We have included this discussion in the revised manuscript.

      The drift-diffusion model (DDM) is fine. I note that the authors included a "leakage rate", which is not a standard DDM parameter (although I like including it). I would have liked to see more about the parameters. What were the distributions? What did the parameters correlate with behaviorally? I would have liked to see distributions of the parameters under the different conditions and different animals. Figure 2C shows the progression of learning. How do the fit parameters change over time as mice shift from choice to choice? How do the parameters change over mice? How do the parameters change over distance to the threat/distance to safety (as per Fanselow and Lester 1988)? They did a supplemental experiment where the threat arrived halfway along the corridor - we could get a lot more detail about that experiment - how did it change the modeling?

      Because our model is fit to the variance of latency distributions, it cannot be applied to singletrial data. Instead, we analyzed how decisions and latencies vary as functions of the fitted threat gain and reward value parameters (Figures 5G and 5H). We have also introduced a simplified deterministic model to further elucidate the decision-making process.

      Regarding the influence of distance to the threat, we conducted additional experiments, presenting the looming stimulus at the end of the arena when the mouse was at different distances from it (Figures S2C–G). We found that as the prey-threat distance increased, mice showed less direct escape behavior, with longer latencies to flee and slower escape speeds. This is consistent with the predatory imminence continuum theory (Fanselow and Lester, 1988), which describes graded defensive behaviors tuned to perceived threat level.

      Regarding the influence of distance to safety, our data indicate that it did not significantly affect defensive responses (Figures S2H and S2I). To test this further, we introduced barriers that lengthened the return path to the safe zone. We found that defensive decisions were not correlated with the distance to the safe zone (Figures S2J and S2K), suggesting that once a threat is detected, animals prioritize escape initiation over evaluating the exact path to safety.

      Overall, this is a reasonable study showing mostly unsurprising results. I think the authors could do more to connect the vigilance question to their results (which seems somewhat new to me).

      We have expanded our analysis of vigilance. In addition to escape latency, we examined the foraging interval and foraging speed. We hypothesized that more vigilant animals would wait longer before re-entering the reward zone following a threat and would approach the reward more slowly. Consistent with this prediction, mice in the sucrose- and water-reward conditions exhibited significantly longer foraging intervals and slower foraging speeds compared to those in the no-reward condition (Figures 3M and 3N). Together, these three measures consistently demonstrate that mice display heightened vigilance under high-threat, high-reward conditions.

      Although the data appear generally fine and the modeling reasonable, the authors do not do the necessary work to set themselves within the extensive literature on decision-making in mice retreating from threats.

      First of all, this is not a new paradigm; variants of this paradigm have been used since at least the 1980s. There is an *extensive* literature on this, including extensive theoretical work on the relation of fear and other motivational factors. I recommend starting with the classic Fanselow and Lester 1988 paper (which they cite, but only in passing), and the reviews by Dean Mobbs and Jeansok Kim, and by Denis Paré and Greg Quirk, which have explicit theoretical proposals that the authors can compare their results to. I would also recommend that the authors look into the "active avoidance" literature. Moreover, to talk about a mouse running from a looming stimulus without addressing the other "flee the predator" tasks is to miss a huge space for understanding their results. Again, I would start with the reviews above, but also strongly urge the authors to look at the Robogator task (work by June-Seek Choi and Jeansok Kim, work by Denis Paré, and others).

      Similarly, in their anatomical review, they do not mention the amygdala. Given the extensive literature on the role of the amygdala in retreating from danger, both in terms of active avoidance and in terms of encoding the danger itself, it would surprise me greatly if this behavior does not involve amygdala processing. (If there is evidence that the amygdala does not play a role here, but that the superior colliculus does, then that would be a *very* important result that needs to be folded into our understanding of decision-making systems and neural computational processing.)

      Second, there is an extensive economic literature on non-human animals in general and on rodents in particular. Again, the authors seem unaware of this work, which would provide them with important data and theories to broaden the impact of their results (by placing them within the literature). First, there are explicit economic literatures in terms of positively-valenced conflicts (e.g., neuroeconomics within the primate literature, sequential foraging and delaydiscounting tasks within the rodent literature), but also there is a long history within the rodent conditioning world, such as the classic work by Len Green and Peter Shizgal. I would strongly urge the authors to explore the motivational conflict literature by people like Gavin McNally, Greg Quirk, and Mark Andermann. Again, putting their results into this literature will increase the impact of their experiment and modeling.

      We have substantially revised the manuscript to contextualize our findings within the extensive literature on defensive behavior and decision-making. The revised Introduction and Discussion now integrate key theoretical frameworks, such as the predatory imminence continuum, and cite relevant work on active avoidance and other "flee the predator" paradigms (e.g., the Robogator task).

      We have also incorporated perspectives from neuroeconomics and motivational conflict, including literature on sequential foraging, delay-discounting tasks, and relevant rodent studies. Furthermore, we now discuss the potential contributions of specific brain regions, including the superior colliculus and the amygdala, to the economic and social modulation of innate defensive decisions in response to visual threats.

      Recommendations for the authors:

      Reviewing Editor Comments:

      These additional recommendations are generally consistent and overlapping across reviewers, particularly Reviewer #1 and 2, so it is advisable to undertake these changes/additions.

      Reviewer #1 (Recommendations for the authors):

      (1) Experimental methods and trial structure need clarification: It is often unclear how many trials were included per condition, per mouse, and whether the key behavioral effects (especially reward-related changes) were observed early in the session or after repeated stimulus exposure. For example, in several reward-related plots (e.g., Figure 3), it is not specified whether results are driven by early or later trials. Since the authors themselves report rapid learning of the looming stimulus (habituation), it is critical to state how many trials were included in each comparison, and to analyze whether effects hold on the first exposure and not the rest. Otherwise, conclusions about value-based behavior are hard to separate from learning effects, which may also differ between individuals. Specifically, the methods section is vague and hard to follow.

      We have substantially expanded the Methods section with additional details to improve clarity.

      To account for individual variability in habituation to the looming stimulus, we segmented trials for each animal into early and late phases. We demonstrate that threat level is the dominant factor driving behavioral responses in the early phase, while both threat level and reward condition shape behavior in the late phase. We have substantially revised Figures 2 and 3 to reflect these changes.

      (2) Add a summary of experimental design: A table or schematic summarizing the trial structure, experimental groups, reward/threat conditions, and the timeline of exposures would greatly improve clarity.

      We have added a schematic to Figure 2 summarizing the trial structure, experimental groups, reward and threat conditions, and the overall timeline.

      (3) Replot key results using only the first trial per mouse: This would allow readers to assess the first (not learned) responses and help control for habituation/suppression.

      We have replotted behavioral results using only the first trial from each mouse and included these analyses in Figure S5. These results confirm that threat level is the dominant factor driving the initial response to looming stimuli.

      (4) The model needs stronger justification and predictive value: As it stands, the model primarily fits the existing data and does not offer new insights beyond what is already evident from the behavioral results.

      Important findings, such as social hierarchy effects and habituation dynamics, are not captured in the model, reducing its relevance to the full dataset.

      The drift-diffusion framework is widely used, and in this implementation appears to have been adjusted post hoc to fit the observed data rather than generating new conceptual advances. No comparison with simpler models is included. Without testing simpler or alternative models, it is not clear whether the added complexity is necessary or justified.

      Use the model to generate and test predictions: to increase the model's contribution, the authors could simulate new conditions. Suggested experiments include:

      a) Predicting escape probability and latency at intermediate threat intensities to test whether behavior shifts gradually or abruptly.

      b) Using the model's habituation parameters to predict changes in escape behavior over repeated exposures.

      c) Adjusting vigilance or threat gain parameters to simulate dominant versus subordinate animals, and comparing model predictions to actual behavioral differences based on social rank.

      We have substantially revised the modeling section to address these concerns. The updated model is now fitted to behavioral data from the late phase of the reward–threat experiments and used to generate predictions for the early phase and for rank-dependent behavioral differences.

      The model accurately captures behavioral patterns across these conditions, demonstrating predictive power beyond descriptive fitting. Accordingly, we have removed the habituation component. Furthermore, we have introduced a simplified deterministic model in the revised manuscript to further understand the decision-making process.

      (5) Clarify housing and arena access conditions: It is unclear from the text whether all mice are in the nest during looming presentations and whether only one mouse is in the arena during the stimulus. This is important for understanding the social context of each trial and should be explained in the main text and methods.

      We have clarified this point in the Methods section. Under normal door operation, only one mouse was allowed in the arena during looming exposure. Specifically, when all mice were in the nest, the nest-tunnel door was open and the tunnel-arena door was closed. Once a single mouse entered the tunnel, as detected by an OpenMV camera, the nest-tunnel door closed and the tunnel-arena opened, ensuring that only that mouse could enter the arena.

      (6) Alternative interpretation of subordinate behavior: differences in area coverage and time in the reward zone may not reflect reduced vigilance, but rather avoidance of dominant mice. Subordinates may remain in the open arena to avoid conflict. The authors do not provide evidence distinguishing between these interpretations, and this should be addressed.

      To address the alternative explanation that subordinate mice may remain in the arena due to restricted nest access, we compared arena occupancy before, during, and after looming exposure (Figure 4C). Before looming exposure, subordinate mice spent significantly more time in the arena, consistent with the idea that they may perceive a social threat from the dominant mouse in the absence of any external threat. However, this difference disappeared during and after looming exposure. This shift suggests that the presence of an external threat alters the social dynamic, reducing the influence of dominance on nest access.

      To further assess whether dominant mice blocked subordinate access to the nest during threatdriven escapes, we analyzed the fraction of escape trials in which mice returned to the nest (Figure 4D). We found no significant difference between dominant and subordinate mice, indicating that dominant mice did not restrict nest access during these trials. Importantly, rank differences in reward-zone occupancy cannot be explained by nest exclusion, as mice do not need to return to the nest when escaping the threat—they can flee directly to the safe zone. Thus, nest access limitations do not account for the observed rank-dependent patterns.

      We agree with the reviewer that reward-zone occupancy should not be interpreted as reduced vigilance in subordinate mice; instead, it likely reflects higher perceived reward value. The manuscript has been revised accordingly.

      (7) Address why robust looming responses were observed in group-housed mice: previous studies often require single housing to elicit strong defensive responses. The authors should explain why their setup yields robust results in group-housed animals and whether housing conditions may interact with dominance or habituation.

      Looming exposure elicits robust defensive behaviors in both group- and single-housed mice (Yilmaz and Meister, 2013, Lenzi et al., 2022), with single-housed animals habituating more quickly to the stimulus (Lenzi et al., 2022). We have now discussed how housing conditions may interact with social rank and habituation to shape defensive behaviors in the revised manuscript.

      For the social-rank experiments, we intentionally co-housed dominant and subordinate mice to maintain a stable hierarchy. This choice was motivated by two considerations. First, our goal was to investigate how social rank modulates defensive responses under ethologically relevant conditions, where mice naturally live in groups. Single housing would remove this social context. Second, singly housing mice can destabilize or eliminate rank relationships, making it difficult to interpret rank-dependent behavioral differences.

      (8) Add analysis of individual variability: trial-by-trial variability or stable behavioral tendencies in individual animals are not explored. This could explain part of the variation currently attributed to social rank.

      We have analyzed individual variability in both dominant and subordinate mice. We observed substantial variability across all behavioral measurements for each group (Figure S7). To attribute the observed behavioral differences to social hierarchy rather than to other individual traits, we conducted paired comparisons between dominant and subordinate mice (Figure 4).

      (9)  Improve figure labeling and readability: some plots are ambiguous in terms of whether rows represent trials or animals. Overlapping points obscure the data in several figures, for example, Figure 3H, sucrose is n=4?- consider using jittered scatter plots, boxplots, or individual traces to improve clarity. Also same Figure axis Y is missing an 'e'.

      We have revised figures to improve clarity and corrected the typos.

      (10) Avoid overinterpretation of causal explanations: Statements such as "reward increases vigilance due to evolutionary pressure" or that "subordinates are less vigilant" go beyond what the current data can demonstrate and should be rephrased more cautiously.

      We have revised the manuscript to tone down the statement.

      Reviewer #2 (Recommendations for the authors):

      (1) Provide much more extensive methodological details on analyses and model fitting

      We have thoroughly revised the Methods section to provide extensive detail on both behavioral analyses and computational modeling, as outlined in our responses to points (3) and (4) of the Public Review.

      (2) Perform experiments or analyses that directly measure vigilance, if vigilance is to remain as a key explanation for the data.

      As detailed in our response to point (1) of the Public Review, we have supplemented the escape latency measure with two direct behavioral analyses of vigilance: foraging interval and foraging speed. This multi-metric approach robustly supports the interpretation of heightened vigilance.

      (3) Provide extra evidence for an effect of reward value, as opposed to the presence or absence of reward. Control for differences arising from the water deprivation state by performing the no reward condition experiments in water-deprived mice.

      All behavioral data in the reward–threat experiment were collected on normal (non-deprived) mice (Figures 2 and 3), which have been clarified in the revised manuscript. We have reanalyzed the data by segmenting trials into early and late phases for each animal. In the late phase, under low-threat conditions, the effect of reward value is reflected in significant differences between water and sucrose in terms of escape distance and time spent in the reward zone (Figures 3I and 3J). Under high-threat conditions, the reward value effect is reflected in significant differences in latency to flee and peak escape speed (Figures 3K and 3N).

      (4)  Using drift rate to describe the "r" variable is confusing because the drift rate of the drift diffusion process is also determined by terms alpha, beta, and h-terms.

      We have termed “r” as the reward value in the revised manuscript.

      Reviewer #3 (Recommendations for the authors):

      (1) I would tone down some of the extreme statements about the problems of previous experiments (such as that most decision-making is on 2AFC). Lots of people do decision-making in serial foraging, fleeing, and other behavioral tasks. The classic Morris water-maze or Barnesmaze are decision-making tasks that aren't 2AFC. Serial foraging tasks, such as the Restaurant Row task aren't 2AFC. And, actually, lots of mouse behavior tasks are deciding when to stop on a treadmill for a reward. And, for that matter, your task isn't all that "realistic" - mice aren't evolved to flee looming disks, they are evolved to flee hawks and owls. This doesn't invalidate your task at all. I just recommend making it about your work in a positive way rather than others in a negative way.

      We have revised the manuscript to adopt a more positive framing of our work.

      (2) I also don't think there's much use in bringing in crayfish in a mouse task. Spend your time connecting to the other rodent data (mice and rats) instead.

      We agree and have revised the manuscript accordingly, focusing our discussion on relevant rodent literature to provide a more appropriate context for our findings.

      Minor concerns:

      (1) The authors use the term "cognitive control" without making clear what they mean. In general, the authors seem to have a view on decision-making as either being "reflexes" or "cognitive control". This is a very outdated perspective. Modern perspectives include multiple decision-making systems competing, separating these based on their computational properties, such as planning, procedural, instinctual, and, yes, reflexive. Current views on the kinds of behaviors they are discussing generally see fleeing as a transition from reflexive (tonic immobility, freezing) and instinctual responses (freezing, fleeing) to deliberative (anxiety) and procedural (habit). The authors might take a look at the recent Calvin and Redish (2025) paper for some ideas on this.

      We appreciate the reviewer’s insight regarding the term “cognitive control.” In our study, we used this term to emphasize that defensive responses to looming threats are not purely reflexive. Mice exhibit four distinct types of defensive decisions within a short time window, and these decisions are systematically modulated by reward value and social rank. Notably, reward modulation is bidirectional: high reward suppresses defensive responses under low-threat conditions but enhances them under high-threat conditions, indicating that animals integrate multiple sources of information rather than relying solely on instinctive mechanisms.

      We did not observe mid-trajectory aborts in mice, as reported in rats by Calvin & Redish (2025). This difference may reflect species-specific behavior or the nature of the threat: our looming stimulus is purely visual and non-harmful, whereas the robotic predator in their study presents a physical threat. We have revised the Discussion to clarify our use of “cognitive control” and to incorporate these perspectives.

      (2) Only male mice were used. This limits the conclusions that can be drawn.

      We acknowledge the limitation of using only male mice and have discussed this limitation in the revised manuscript.

      (3) Did the authors observe darting behavior? (Gruene...Shansky 2015).

      We did not observe darting behavior, characterized by rapid movement, as reported during inescapable fear conditioning. In our experiment, the mice consistently escaped towards the nest, in most trials, ran directly to the nest without stopping. Occasionally, under low contrast conditions, mice paused once or twice but never moved towards the reward.

      (4) How was only one mouse allowed into the linear arena at a time?

      When all mice were in the nest, the nest-tunnel door was open while the tunnel-arena door remained closed. When a single mouse entered the tunnel, as detected by the RFID and OpenMV camera system, the nest-tunnel door closed and the tunnel-arena door opened, allowing only that mouse to enter the arena. We have clarified this protocol in the Methods section.

      (5) I would like to see more extensive analyses of the animal's responses as a function of distance to the threat (as per Fanselow and Lester 1988).

      As detailed in our response to the public review, we conducted new experiments analyzing behavior as a function of prey–threat distance. The finding that defensive responsiveness decreases with increasing prey–threat distance is now presented in Figures S2C–G and discussed in the context of the predatory imminence continuum.

    1. eLife Assessment

      This is an important study reporting that activation of the presynaptic GPR55 receptor suppresses synaptic transmission by modulating GABA release through the reduction of the readily releasable pool without affecting the presynaptic AP waveform and calcium influx. The evidence supporting this claim is compelling and based on an impressive array of techniques including patch-clamp recordings from the axon terminals of cerebellar Purkinje cells and fluorescent imaging of vesicular exocytosis. While the authors have strengthened their conclusions on several technical fronts in the revised version, further investigation is needed into the mechanism by which GPR55 activation might make vesicles insensitive to the rise in presynaptic [Ca²⁺] mediated by VGCCs, and the nature of the endogenous process that would activate this pathway in vivo.

    2. Reviewer #1 (Public review):

      In this manuscript, the authors report that GPR55 activation in presynaptic terminals of Purkinje cells decrease GABA release at the PC-DCN synapse. The authors use an impressive array of techniques (including highly challenging presynaptic recordings) to show that GPR55 activation reduces the readily releasable pool of vesicle without affecting presynaptic AP waveform and presynaptic Ca2+ influx. This is an interesting study, which is seemingly well-executed and proposes a novel mechanism for the control of neurotransmitter release. However, the authors' main conclusions are heavily, if not solely, based on pharmacological agents that most often than not demonstrate affinity at multiple targets. Below are points that the authors should consider in a revised version.

      Major points:

      (1) There is no clear evidence that GPR55 is specifically expressed in presynaptic terminals at the PC-DCN synapse. The authors cited Ryberg 2007 and Wu 2013 in the introduction, mentioning that GPR55 is potentially expressed in PCs. Ryberg (2007) offers no such evidence, and the expression in PC suggested by Wu (2013) does not necessarily correlate with presynaptic expression. The authors should perform additional experiments to demonstrate presynaptic expression of GPR55 at PC-DCN synapse.

      (2) The authors' conclusions rest heavily on pharmacological experiments, with compounds that are sometimes not selective for single targets. Genetic deletion of GPR55 would be a more appropriate control. The authors should also expand their experiments with occlusion experiments, showing if the effects of LPI are absent after AM251 or O-1602 treatment. In addition, the authors may want to consider AM281 as a CB1R antagonist without reported effects at GPR55.

      (3) It is not clear how long the different drugs were applied, and at what time the recording were performed during or following drug application. It appears that GPR55 agonists can have transient effects (Sylantyev, 2013; Rosenberg, 2023), possibly due to receptor internalization. The timeline of drug application should be reported, where IPSC amplitude is shown as a function of time and drug application windows are illustrated.

      (4) A previous investigation on the role of GPR55 in the control of neurotransmitter release is not cited nor discussed Sylantyev et al., (2013, PNAS, Cannabinoid- and lysophosphatidylinositol-sensitive receptor GPR55 boosts neurotransmitter release at central synapses). Similarities and differences should be discussed.

      Minor point:

      (1) What is the source of LPI? What isoform was used? The multiple isoforms of LPI have different affinities for GPR55.

      Comments on revisions:

      In this revised version, the authors have addressed my major concerns. Notably, they used CRISPR/Cas9 genetic knockdown of GPR55 to independently validate their original findings. The main conclusions are now well supported and represent an important contribution to the field.

    3. Reviewer #2 (Public review):

      Summary:

      This paper investigates the mode of action of GPR55, a relatively understudied type of cannabinoid receptors, in presynaptic terminals of Purkinje cells. The authors use demanding techniques of patch clamp recording of the terminals, sometimes coupled with another recording of the postsynaptic cell. They find a lower release probability of synaptic vesicles after activation of GPR55 receptors, while presynaptic voltage-dependent calcium currents are unaffected. They propose that the size of a specific pool of synaptic vesicles supplying release sites is decreased upon activation of GPR55 receptors.

      Strengths:

      The paper uses cutting edge techniques to shed light on a little studied, potentially important type of cannabinoid receptors. The results are clearly presented, and the conclusions are sound.

      Weaknesses:

      The nature of the vesicular pool that is modified following activation of GPR55 is not definitively characterized.

      Comments on revisions:

      The authors have done a good job in answering the criticisms of reviewers. Consequently, the revised version offers a substantial improvement over the first version.

    4. Reviewer #3 (Public review):

      Inoshita and Kawaguchi investigated the effects of GPR55 activation on synaptic transmission in vitro. To address this question, they performed direct patch-clamp recordings from axon terminals of cerebellar Purkinje cells and fluorescent imaging of vesicular exocytosis utilizing synapto-pHluorin. They found that exogenous activation of GPR55 suppresses GABA release at Purkinje cell to deep cerebellar nuclei (PC-DCN) synapses by reducing the readily releasable pool (RRP) of vesicles. This mechanism may also operate at other synapses.

      Strengths:

      The main strength of this study lies in combining patch-clamp recordings from axon terminals with imaging of presynaptic vesicular exocytosis to reveal a novel mechanism by which activation of GPR55 suppresses inhibitory synaptic strength. The results strongly suggest that GPR55 activation reduces the RRP size without altering presynaptic calcium influx.

      Weaknesses:

      The study relies on the exogenous application of GPR55 agonists. It remains unclear whether endogenous ligands released by physiological or pathological processes would have similar effects. There is also little evidence that GPR55 is expressed in Purkinje cell axon boutons. This study would benefit from the use of GPR55 knockout (KO) mice. The downstream mechanism by which GPR55 mediates the suppression of GABA release remains unknown.

      Comments on revisions:

      The authors have addressed all my concerns effectively. I have no further comments and want to commend their comprehensive study.

    5. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public review):

      In this manuscript, the authors report that GPR55 activation in presynaptic terminals of Purkinje cells decrease GABA release at the PC-DCN synapse. The authors use an impressive array of techniques (including highly challenging presynaptic recordings) to show that GPR55 activation reduces the readily releasable pool of vesicle without affecting presynaptic AP waveform and presynaptic Ca<sup>2+</sup> influx. This is an interesting study, which is seemingly well-executed and proposes a novel mechanism for the control of neurotransmitter release. However, the authors' main conclusions are heavily, if not solely, based on pharmacological agents that most often than not demonstrate affinity at multiple targets. Below are points that the authors should consider in a revised version.

      We are happy to hear the encouraging comments from this reviewer, and thank for pointing out the important issues including the previous study design depending only on pharmacological agents. To address these, we have performed additional experiments, as detailed below.

      Major points:

      (1) There is no clear evidence that GPR55 is specifically expressed in presynaptic terminals at the PC-DCN synapse. The authors cited Ryberg 2007 and Wu 2013 in the introduction, mentioning that GPR55 is potentially expressed in PCs. Ryberg (2007) offers no such evidence, and the expression in PC suggested by Wu (2013) does not necessarily correlate with presynaptic expression. The authors should perform additional experiments to demonstrate the presynaptic expression of GPR55 at PC-DCN synapse.

      We completely agree with the reviewer in that our previous manuscript lacked the reliable information regarding presynaptic expression of GPR55 at PC boutons.

      To clarify the localization, we first tried immunostaining of GPR55 using commercially available antibodies, but unfortunately they did not provide clear labeling of neurons and also even in GPR55-transfected HEK cells (used as positive control). Thus, we gave up the direct immunostaining. Alternatively, we attempted to label PC axonal boutons by GPR55-targeting dye together with a complementary strategy based on gene knock-down. Specifically, we used T1117, a fluorescent derivative of AM251 which is a GPR55 ligand used in the manuscript, and clear fluorescent signals were evident at GFP-labeled PC terminals. Still, by itself it was not clear whether the labeling was mediated by association with GPR55. Therefore, we also attempted to specifically suppress gene expression of GPR55 using CRISPR/Cas9-mediated genome editing in PCs, based on acute DNA micro-injection of plasmids into nuclei of PCs to express gRNAs targeting GPR55 together with Cas9. As a result, 5 days after the knock-down, T1117 labeling at axon terminals was reduced by ~50% compared to Cas9-alone controls. All these data are now shown in new Figure 2, and explained in the text p5-6, lines 141-159. Further, the reduction of GPR55 expression abolished the AM251-mediated reduction of vesicular exocytosis, as shown in new Figure 3D, E.

      Taken together, these results essentially convince our main conclusions by strongly suggesting that GPR55 is present at PC axon terminals, where it negatively regulates the exocytosis upon activation by AM251.  

      (2) The authors' conclusions rest heavily on pharmacological experiments, with compounds that are sometimes not selective for single targets. Genetic deletion of GPR55 would be a more appropriate control. The authors should also expand their experiments with occlusion experiments, showing if the effects of LPI are absent after AM251 or O-1602 treatment. In addition, the authors may want to consider AM281 as a CB1R antagonist without reported effects at GPR55.

      We thank the reviewer for pointing out these important issues. First, as noted above to confirm the presence of GPR55 at axon terminals of PCs, we performed genetic deletion of GPR55 using CRISPR/Cas9 system. In PCs co-expressing Cas9 and two gRNAs targeting the ligand-binding domain of GPR55, AM251 failed to suppress the exocytosis at PC boutons, together with decreased T1117 labeling. Therefore, the idea that GPR55 negatively regulates transmitter release at PC boutons has now been strengthened. The new data is shown in Figure 3D and E, and explained in the text p6, lines 173-178.  

      As suggested, we also carried out the occlusion experiments with LPI and AM251. First, LPI similarly reduced the readily releasable pool (RRP) size as AM251 did. Then, applied together, LPI and AM251 did not further reduce the RRP size compared with the effect by either compound alone. Thus, LPI and AM251 seem to act through the same pathway, consistent with the idea for role of GPR55 activation. The data is shown in new Figure 5—figure supplement 1 and explained in the text, p7-8, lines 215-221.

      Regarding another point suggested by the reviewer, we applied AM281 and observed no effect on transmission at the PC–target neuron synapses (shown in new Figure 1F and I; explained in the text p5, lines 117-123), indicating that the effect of AM251 is likely to be mediated by GPR55, but not by CB1R.

      Taken together, our additional experiments based on genetic and pharmacological experiments have consolidated our conclusion that GPR55 suppresses the presynaptic neurotransmitter release in PC boutons.

      (3) It is not clear how long the different drugs were applied, and at what time the recordings were performed during or following drug application. It appears that GPR55 agonists can have transient effects (Sylantyev, 2013; Rosenberg, 2023), possibly due to receptor internalization. The timeline of drug application should be reported, where IPSC amplitude is shown as a function of time and drug application windows are illustrated.

      Thank you for suggesting the better presentation of data. Accordingly, we have re-organized figures showing time course of changes in IPSCs before and after the drug application (new Figure 1 and 4; p4, lines 94-97; p5, lines 110-115; p7, lines 193-197). The current data presentation clearly shows that the effect of AM251 becomes evident in a few minutes after application, and somehow reaches a saturated level.

      (4) A previous investigation on the role of GPR55 in the control of neurotransmitter release is not cited nor discussed (Sylantyev et al., (2013, PNAS, Cannabinoid- and lysophosphatidylinositolsensitive receptor GPR55 boosts neurotransmitter release at central synapses). Similarities and differences should be discussed.

      We are really sorry for failing to adequately discuss this important work in our previous manuscript, and deeply appreciate the reviewer for pointing this out. We have now cited and discussed the work by Sylantyev et al. (2013), in the text (p12, lines 380-389), as following:

      ‘Pioneering studies clarified an important role of GPR55 in synaptic transmission at hippocampal excitatory synapses, demonstrating presynaptic enhancement of glutamate release presumably by elevating the cytoplasmic residual Ca<sup>2+</sup> via release from intracellular stores (Sylantyev et al., 2013; Rosenberg et al., 2023), in contrast to the suppression of release in our observation. The lack of positive modulation of AP-triggered release through residual Ca<sup>2+</sup> in PC terminals might be due to abundant amount of potent Ca<sup>2+</sup> buffer calbindin (Fierro and Llano, 1996). Indeed, increased vesicular fusion only for the AP-insensitive spontaneous vesicular release (as mIPSCs) was observed upon the IP<sub>3</sub>-mediated Ca<sup>2+</sup> release from internal store (Gomez et al., 2020). Thus, minimal sensitivity of AP-triggered release to residual Ca<sup>2+</sup> in PC boutons would underlie the distinct effects of GPR55 activation at the presynaptic side.’  

      Minor point:

      (1) What is the source of LPI? What isoform was used? The multiple isoforms of LPI have different affinities for GPR55.

      Thank you for letting us know about the lack of important information in the previous manuscript. In our experiments, we used a soybean-derived LPI mixture containing approximately 58% C16:0 and 42% C18:0 or C18:2 species. According to Brenneman et al. (2025), these isoforms show moderate or strong effects in cultured DRG neurons, whereas the C20:4 isoform, reported to promote neuroinflammatory signaling, was contained only at very low levels. We have added this information to the revised manuscript and briefly discussed the influence of different LPI isoforms on the physiological outcomes of GPR55 activation (p5, lines 127-131; p15, lines 493-496).

      Reviewer #2 (Public review):

      Summary:

      This paper investigates the mode of action of GPR55, a relatively understudied type of cannabinoid receptor, in presynaptic terminals of Purkinje cells. The authors use demanding techniques of patch clamp recording of the terminals, sometimes coupled with another recording of the postsynaptic cell. They find a lower release probability of synaptic vesicles after activation of GPR55 receptors, while presynaptic voltage-dependent calcium currents are unaffected. They propose that the size of a specific pool of synaptic vesicles supplying release sites is decreased upon activation of GPR55 receptors.

      Strengths:

      The paper uses cutting-edge techniques to shed light on a little-studied, potentially important type of cannabinoid receptor. The results are clearly presented, and the conclusions are for the most part sound.

      We feel very happy to see the positive comments from the reviewer.  

      Weaknesses:

      The nature of the vesicular pool that is modified following activation of GPR55 is not definitively characterized.

      We agree with the reviewer in that our data cannot fully address the changes of vesicle pools caused by GPR55. As detailed in responses to comments in ‘Recommendations for the authors’ from the reviewer, we have added explanation and discussion in the main text of the revised manuscript.

      Reviewer #3 (Public review):

      Summary:

      Inoshita and Kawaguchi investigated the effects of GPR55 activation on synaptic transmission in vitro. To address this question, they performed direct patch-clamp recordings from axon terminals of cerebellar Purkinje cells and fluorescent imaging of vesicular exocytosis utilizing synaptopHluorin. They found that exogenous activation of GPR55 suppresses GABA release at Purkinje cell to deep cerebellar nuclei (PC-DCN) synapses by reducing the readily releasable pool (RRP) of vesicles. This mechanism may also operate at other synapses.

      Strengths:

      The main strength of this study lies in combining patch-clamp recordings from axon terminals with imaging of presynaptic vesicular exocytosis to reveal a novel mechanism by which activation of GPR55 suppresses inhibitory synaptic strength. The results strongly suggest that GPR55 activation reduces the RRP size without altering presynaptic calcium influx.

      We thank the reviewer for giving the encouraging comments on our study.

      Weaknesses:

      The study relies on the exogenous application of GPR55 agonists. It remains unclear whether endogenous ligands released due to physiological or pathological activities would have similar effects. There is no information regarding the time course of the agonist-induced suppression. There is also little evidence that GPR55 is expressed in Purkinje cells. This study would benefit from using GPR55 knockout (KO) mice. The downstream mechanism by which GPR55 mediates the suppression of GABA release remains unknown.

      We thank the reviewer for pointing out all of these important issues to be ideally addressed. As detailed in the responses to comments in the ‘Recommendations for the authors’ from the reviewers, we have addressed most of these weak points, and also added careful discussion in the text about the open questions to be solved in the future study.

      Recommendations for the authors:

      Reviewer #2 (Recommendations for the authors):

      This is a high-quality paper that reports novel and interesting results. The authors should consider one main critique, related to Figure 6, as well as a number of minor points.

      We thank the reviewer for making very positive assessment of our study. We have carefully considered the main critique regarding presynaptic vesicle pools (related to previous Figure 6), as well as other points, and accordingly revised manuscript.

      Main critique:

      In Figure 6, it is said that GPR55 locks SVs in a state that is insensitive to VGCCs, based on a series of experiments with synapto-pHluorin. This conclusion is open to several critiques:

      The authors' model is shown in the diagram of Figure 6A. In this scheme, it appears as if recycled SVs eventually re-acidify in spite of the presence of bafilomycin, and that they are directed to a location close to the plasma membrane, but away from VGCCs. In fact, there is no evidence that the effects of bafilomycin could be limited in time. And there is a lot of evidence indicating that recycled SVs move back to release sites, close to VGCCs.

      We are so sorry for presenting misleading figure panel in the previous Figure 6A. As the reviewer says, the effect of bafilomycin should be expected to last for long, and then the endocytosed vesicles cannot be re-acidified. Now, in new Figure 8A, we have changed the panel for explanation about the experimental situation of vesicles in the presence of bafilomycin. Another insightful point, kindly suggested by the reviewer, regarding the quick recruitment of newly endocytosed vesicles to release sites, is highly related to the interpretation of our data, but is a different issue from the situation explained in new Figure 8A. To avoid confusion, the arrow drawn in the previous version indicating the endocytosed vesicle movement back to the docked situation has been omitted in the new panel, and this critical issue is now carefully discussed in terms of the mechanism of GPR55 action on the release machinery (p15, lines 480-482).

      The saturation of the train-induced signals is interpreted as reflecting an exhaustion of SVs initially close to VGCCs or more generally, susceptible to being released following VGCC activation.

      In an alternative scenario, saturation occurs because AP trains, or KCl applications, become unable to activate VGCCs. This could occur either because long illumination causes photodamage of VGCCs, or because repeated activation of VGCCs leads to their inactivation. The latter explanation is possible in spite of a publication from the authors' laboratory describing the facilitation of presynaptic VGCCs following paired stimulations in this synapse (Diaz-Rojas et al., 2015).

      We agree that it is an important control experiment to demonstrate that Ca<sup>2+</sup> increase upon repetitive AP trains is intact even during or after the long photo-illumination for imaging. To test this possibility, we have performed additional fluorescent Ca<sup>2+</sup> imaging at PC varicosities during individual 400-AP trains and also in response to 50 mM KCl following the series of AP trains. Now new data demonstrated that Ca<sup>2+</sup> influx remains constant across all AP trains (shown in Figure 8— figure supplement 1), arguing against VGCC inactivation or photodamage as a major factor underlying the saturated signal increase in the synapto-pHluorin. We have added explanation regarding this issue in the text p11, lines 327-329.

      The authors explain the larger effect of ionomycin compared with AP trains and KCl applications as reflecting a better capacity to increase the bulk calcium concentration. The above proposal for the inactivation of VGCCs offers an alternative explanation, in my view more likely.

      As noted above, our newly added Ca<sup>2+</sup> imaging data clearly showed that individual AP trains induced similar Ca<sup>2+</sup> influxes during repetitive trials, in line with our original interpretation. In addition, the Ca<sup>2+</sup> increase by KCl was shown to be more potent and broader in axon terminals and trunks. Nevertheless, the exocytic signal caused by ionomycin was clearly large, implying a critical effect of the source of Ca<sup>2+</sup> influx in PC boutons. Therefore, we suppose that the marked effect of ionomycin on release reflects higher elevation of bulk Ca<sup>2+</sup> in the cytoplasm arising from non-site selective Ca<sup>2+</sup>-ionophore (Figure 8—figure supplement 1, p11, lines 327-334; lines 342-349).

      In yet another scenario, recycled SVs in bafilomycin retain their fluorescence since they do not reacidify, but they come back to release sites to undergo new rounds of exocytosis. The new exocytosis events do not increase the fluorescence since the pH in the vicinity of synapto-pHluorin does not change. NH4Cl would then increase the fluorescence by revealing SVs that had not undergone exocytosis-endocytosis cycles during AP trains or KCl exposure. In this last scenario, the GPR55-sensitive SV pool would be a specific sub-pool of SVs that can be recycled by repetitive 400 AP trains.

      We deeply appreciate the reviewer for pointing out this important possibility. We completely agree that this scenario can also explain the pool which is sensitive to GPR55. Therefore, we have added explanation of this possibility in the text (p15, lines 474–482).

      Figure 6F shows calcium imaging measurements of PC varicosities. Unfortunately, crucial measurements are missing. It would have been revealing to compare calcium rises for the first and the last of the 8 400-AP trains. And to compare calcium rises elicited by 60 mM KCl before and after the series of 8 400-AP trains.

      This is an important control experiment. Therefore, we have performed additional Ca<sup>2+</sup> imaging during the eight 400-AP trains and KCl application. The new results shown in the present Figure 8—figure supplement 1 clearly suggest that Ca<sup>2+</sup> rises are comparable between the first and eighth trains, and that additional Ca<sup>2+</sup> influx (which was large in amplitude and wide in area) could still be evoked by KCl after the eight trains. The experiments are explained in the text p11, lines 327336.

      Minor points:

      (1) Introduction: The Introduction would benefit from a more substantial description of what is known about GPR55 and downstream signaling pathways. Right now, it is stated that GPR55 is 'potentially expressed in PCs': What are the arguments behind this statement? Also, the signaling pathway is discussed on p.12, much too late in the ms. Why not move this section to the Introduction?

      We thank the reviewer for the helpful suggestion. As recommended, in the revised manuscript, we have changed the Introduction by moving the sentences from other sections, including speculation about the expression of GPR55 in Purkinje cells (Ryberg et al., 2007; Wu et al., 2013) (p3-4, lines 71-75) and downstream signaling pathways (Gα<sub>q</sub>/PLC/IP<sub>3</sub>/Ca<sup>2+</sup> and Gα<sub>13</sub>/RhoA/ROCK) (p3, 63-68).  

      (2) Legend to Figures 1, 2, and 4: What is the EGTA concentration in these experiments?

      As suggested, the EGTA concentrations (0.5 or 5 mM) used in the individual experiments have now been clearly indicated both in the figure legends and in the Methods section (p18, lines 585586).

      (3) Fig. 3C: These experiments show that some SV pool is depleted by AM251. The authors state that this is the RRP, but other options are possible. In the calyx of Held, similar experiments are supposed to deplete not only the FRP (=RRP, presumably) but also the SRP.

      We thank the reviewer for pointing out the important aspect related to category for vesicle pools. In PC boutons, the membrane capacitance increases in response to different duration of depolarization pulses in a manner fitted by a single exponential curve (see Figure 5C for example). Our previous study (Kawaguchi and Sakaba, 2015) noted that the vesicle pools corresponding to FRP and SRP may not be easy to distinguish in PCs, suggesting apparently single component. That’s the reason why we simply describe the component as RRP in the present manuscript. Still, as suggested, careful discussion about typical fast- and slow components would be helpful to interpret our present findings. Therefore in the revised manuscript, we have added a sentence to explain this issue (p7, lines 211-214).

      (4) p. 8: When the 400 APs protocol is introduced, the corresponding frequency (20 Hz?) should be mentioned. This information comes only much later in the ms.

      We are sorry for our insufficient explanation in the previous manuscript. As suggested, we have clearly written the stimulation frequency ‘20 Hz’ in the main text where the 400 APs protocol first appears (p9, lines 277-278).

      (5) Figure 5, panels B and F: synapto-pHluorin is labelled twice 'synapto-pHluolin'.

      Sorry for careless typos. Now, those are corrected (new Figure 7).

      (6) Legend to Figure 5, last line: 'x' is missing in the last equation.

      Thank you for the careful and kind check. Now, ‘x’ has been added to the last equation in the legend for new Figure 7.

      (7) p. 7, Interpretation of EGTA effects: The authors frame their interpretation of EGTA effects around the distance between release sites and VGCCs. However since AM251 appears to alter the recruitment of SVs, a more parsimonious interpretation would be that EGTA modifies the calciumdependent movement of SVs towards release sites.

      Thank you for suggesting an insightful scenario. We agree that the capacitance jump upon long depolarization pulse would include exocytosis of substantial amount of vesicles which are newly recruited during the Ca<sup>2+</sup> increase. Then, as the reviewer states, EGTA possibly lowers the Ca<sup>2+</sup>dependent replenishment of synaptic vesicles, and this replenishment system might be the target of GPR55 activation. Therefore, we have now clearly added an explanation about this possibility in the text (p15, lines 474-482).

      (8) p. 13, Interpretation of GPR55 sensitive SV pool: The authors suggest a larger distance to VGCCs for this pool compared to naïve SVs. An alternative could be that in the presence of GPR55, the recruitment to release sites would be less efficient.

      This is also an insightful suggestion to speculate the causal relationship between the GPR55mediated reduction of vesicular release and the vesicle pools. Accordingly, we have revised the Discussion (see “Dynamics of synaptic vesicles among distinct functional pools”) by clearly telling about the possibility of decreased recruitment of vesicles to release sites after the GPR55 activation (p15, lines 474-482). By totally considering all the suggested scenario, we believe that the possible mechanisms for GPR55-mediated reduction of release are much more clearly explained in the revised manuscript.

      Reviewer #3 (Recommendations for the authors):

      (1) The time course of the agonist-induced suppression should be reported (Figure 1).

      This is an important point to show data clearly, as suggested also by the reviewer 1. Accordingly, we have changed the figure panels to show time courses of agonist-induced suppression (shown in new Figures 1 and 4).  

      (2) Show that the suppression of GABAergic transmission mediated by AM251 and LPI is eliminated in GPR55 KO mice.

      We appreciate the reviewer for putting us to try this important experiment. Owing to the suggestion, we attempted to knock-down the GPR55 expression using CRISPR/Cas9 in cultured Purkinje cells. To avoid potential developmental compensations, here we adopted the CRISPR/Cas9-based genome editing approach, rather than using global knock out mice. Those GPR55-KO cells, as noted above in response to the comment #2 of reviewer #1, showed decreased fluorescent labeling of PC axon terminals to fluorescent-variant of AM251 (shown in new Figure 2) and abolishment of AM251-mediated suppression of vesicle exocytosis (Figure 3D and E). These results are explained in the text p5-6, lines 141-159; p6, lines 173-178.  

      (3) Include references supporting AM251 and LPI as GPR55 agonists and specify the E50 concentrations for each agonist. Furthermore, provide details about the GPR55 antagonist CID16600046.

      As suggested, we have added references regarding GPR55 agonists, AM251 and LPI. In the text, the following information was added: AM251, originally characterized as an inverse agonist for CB1, has also been reported to act as a GPR55 agonist (Ryberg et al., 2007; Henstridge et al., 2009) (p5, lines 115-116). LPI is an established endogenous GPR55 agonist (Oka et al., 2007; Henstridge et al., 2009) (p5, lines 127-129). The reported EC<sub>50</sub> values are ~ 30 nM for LPI (Oka et al., 2007, HEK cell assay) and 39 nM for AM251 (Ryberg et al., 2007, HEK cell assay) (p4, lines 94-95; p5, lines 127-129). Regarding the GPR55 antagonist CID16020046, detailed information (IC<sub>50</sub> = 0.21 µM for GPR55 without significant effect on CB1 receptor) was added in the text with an appropriate citation (Kargl et al., 2013) (p5, lines 123-127). These points have also been added to the Methods section (p17, lines 587-589).

      (4) Regarding the onset delay (Figure 4C; page 8, lines 3-4), consider the following: "AM251 induced a modest yet significant synaptic delay, estimated by the time to the onset of release" (or something similar).

      We thank the reviewer for suggesting helpful explanation. Accordingly, we have changed the sentence to explain the delayed onset (p9, lines 264-265).

      These three points should be properly acknowledged in the Discussion:

      (1) Are action potentials (APs)/depolarizations and ionomycin applications comparable? Ionomycin mediates a large calcium rise significantly slower than the calcium rise mediated by fast depolarization. Such presynaptic calcium dynamics could account, in part, for the different results.

      The qualitative difference of Ca<sup>2+</sup> increase between APs/depolarization-mediated ones and ionomycin-mediated one is an important point. Thank you for pointing out this issue. In the revised manuscript, we have added an explanation about the possible difference arising from the distinct dynamics of Ca<sup>2+</sup> increases caused by direct depolarization of axon terminals or by ionomycin (p14, lines 452-453).

      (2) Previous studies on hippocampal CA3-CA1 pyramidal cell synapses indicate that GPR55 activation enhances glutamate release through presynaptic calcium modulation while diminishing inhibitory postsynaptic strength by reducing GABAA receptors (Sylantyev et al., PNAS 2013; Rosenberg et al., Neuron 2023). In contrast, Inoshita and Kawaguchi discovered that GPR35 suppresses PC-DCN inhibitory transmission by decreasing GABA release without affecting inhibitory postsynaptic strength. Some potential explanation for this discrepancy is warranted.

      We appreciate the reviewer for pointing out this important issue, and feel sorry for not providing an appropriate discussion about the possible interpretation in the previous manuscript. In the revised manuscript, we have added explanations for this discrepancy. First, PC terminals show only limited influence by elevated cytoplasmic Ca<sup>2+</sup> through ER store on GABA release (Gomez et al., 2020) probably due to abundant calbindin. Second, our present data clearly show the GPR55 signals at PC terminals (although indirect, see Figure 2), while hippocampal inhibitory neuronal boutons somehow showed lower GPR55 levels compared with excitatory neuronal boutons (Rosenberg et al., Neuron, 2023). Third, the subtypes and/or anchoring mechanism for postsynaptic GABA<sub>A</sub> receptors might be different between two distinct postsynaptic neurons in the hippocampus and the cerebellum. These factors are now clearly discussed in the text (p12, lines 380-396).

      (3) Earlier work has suggested that CB1 receptor activation can alter the release machinery. Therefore, the observation that GPR55 activation induces changes in the RRP is not entirely surprising.

      As pointed out, previous studies showed that CB1R influences the synaptic release machinery, rather than Ca<sup>2+</sup> influx (Ramirez-Franco et al., 2014). In that context, as the reviewer says, the GPR55-mediated RRP change can be regarded as a similar synaptic modulation mechanism as the CB1-mediated one. However, considering the different downstream signaling pathways, G<sub>12/13</sub>- or G<sub>q</sub>-mediated one and G<sub>i/o</sub>-mediated one, our findings would provide an important scope about the regulation mechanisms of release machinery, which should be further analyzed in the future study. Now we have added these points in discussion (p13-14, lines 435-439).

      (4) Add a section about the limitations of this study (see Weaknesses above).

      As suggested, we have added a section about the limitations of this study at present, which we could not address in the revision and should be addressed in the future (p15, lines 488-508). Particularly, the actual endogenous agonist to activate GPR55, and the physiological situation in which the agonist is produced, much more direct evidence for GPR55 presence at PC boutons, and the downstream mechanisms of GPR55-mediated suppression of GABA release are now clearly notified in that section.

      (5) Double-check grammar and typos ("anandamid").

      We are really sorry for the poor writings in the previous manuscript. Now, we have carefully checked the text.

    1. eLife Assessment

      This important research investigates the precision of numerosity perception in two types of tasks and concludes that human performance aligns with an efficient coding model optimized for current environmental statistics and task goals. The proposed model receives compelling evidence from two numerosity perception experiments and a reanalysis of an existing dataset of risky decision-making. These findings have theoretical implications for our understanding of numerosity perception and decision-making as well as the ongoing debate on different efficient coding models.

    2. Reviewer #1 (Public review):

      Summary:

      The "number sense" refers to an imprecise and noisy representation of number. Many researchers propose that the number sense confers a fixed (exogenous) subjective representation of number that adheres to scalar variability, whereby the variance of the representation of number is linear in the number.

      This manuscript investigates whether the representation of number is fixed, as usually assumed in the literature, or whether it is endogenous. The two dimensions on which the authors investigate this endogeneity are the subject's prior beliefs about stimuli values and the task objective. Using two experimental tasks, the authors collect data that are shown to violate scalar variability and are instead consistent with a model of optimal encoding and decoding, where the encoding phase depends endogenously on prior and task objectives. I believe the paper asks a critically important question. The literature in cognitive science, psychology, and increasingly in economics, has provided growing empirical evidence of decision-making consistent with efficient coding. However, the precise model mechanics can differ substantially across studies. This point was made forcefully in a paper by Ma and Woodford (2020, Behavioral & Brain Sciences), who argue that different researchers make different assumptions about the objective function and resource constraints across efficient coding models, leading to a proliferation of different models with ad-hoc assumptions. Thus, the possibility that optimal coding depends endogenously on the prior and the objective of the task, opens the door to a more parsimonious framework in which assumptions of the model can be constrained by environmental features. Along these lines, one of the authors' conclusions is that the degree of variability in subjective responses increases sublinearly in the width of the prior. And importantly, the degree of this sublinearity differs across the two tasks, in a manner that is consistent with a unified efficient coding model.

      Comments on revisions:

      The authors have done an excellent job addressing my main concerns from the previous round. The new analyses that address the alternative model of "no cognitive noise and only motor noise" are compelling and provide quantitative evidence that bolsters the paper's overall contribution. The authors also went above and beyond by reanalyzing the Frydman and Jin (2022) dataset to provide new and very interesting analyses that provide an additional out of sample test of the model proposed in the current paper.

    3. Reviewer #2 (Public review):

      Summary:

      This paper provides an ingenious experimental test of an efficient coding objective based on optimization as a task success. The key idea is that different tasks (estimation vs discrimination) will, under the proposed model, lead to a different scaling between the encoding precision and the width of the prior distribution. Empirical evidence in two tasks involving number perception supports this idea.

      Strengths:

      - The paper provides an elegant test of a prediction made by a certain class of efficient coding models previously investigated theoretically by the authors.<br /> The results in experiments and modeling suggest that competing efficient coding models, optimizing mutual information alone, may be incomplete by missing the role of the task.

      - The paper carefully considers how the novel predictions of the model interact with the Weber/Fechner law.

      Weaknesses:

      - The claims would be even more strongly validated if data were present at more than two widths in the discrimination experiment (also noted in Discussion).

    4. Reviewer #3 (Public review):

      Summary:

      This work investigates whether human imprecision in numeric perception is a fixed structural constraint or an endogenous property that adapts to environmental statistics and task objectives. By measuring behavioral variability across different uniform prior distributions in both estimation and discrimination tasks, the authors show that perceptual imprecision increases sublinearly with prior width. They demonstrate that the specific exponents of this scaling (1/2 for estimation and 3/4 for discrimination) can be derived from an efficient-coding model, wherein decision-makers optimally balance task-specific expected rewards against the metabolic costs of neural coding. The revised manuscript expands this framework to accommodate logarithmic representations and validates the core model against an independent dataset of risky choices.

      Strengths:

      The authors have effectively addressed my previous concerns with rigorous additions:

      (1) The mathematical formulation has been revised into a discrete signal accumulation framework, making the objective function and resource trade-offs much more transparent and mathematically tractable.

      (2) The incorporation of the logarithmic representation resolves prior ambiguities regarding structural constraints.

      (3) The new split-half analysis effectively addresses the temporal dynamics of adaptation. The stability of the sublinear scaling across the experiment provides solid evidence that human subjects utilize rapid, top-down modulation to adjust their encoding strategy when explicitly informed about the environment.

      (4) Validating the derived scaling exponents on an independent risky-choice dataset robustly supports the generalizability of the theoretical framework beyond a single cognitive domain.

      Weaknesses:

      The methodological and theoretical issues raised in the first round have been thoroughly resolved, and the evidence supporting the claims regarding response variance is convincing.

      There is one remaining theoretical point that warrants discussion to provide a complete picture of the proposed generative model. The manuscript exquisitely models and predicts response variance (imprecision), but it remains largely silent on the closed-form predictions for the mean estimation (i.e., bias). Under the assumption of optimal Bayesian decoding combined with specific encoding schemes (e.g., linear vs. logarithmic), the model implicitly generates mathematical predictions for the subjects' mean estimates. Specifically, varying the scaling exponent (α) and the prior width (w) should systematically alter the predicted bias in different conditions.

      While fitting or explicitly explaining this mean bias is not strictly necessary for the core claims regarding variance scaling, acknowledging what the optimal decoder analytically predicts for the mean estimation-and how it aligns or contrasts with typical empirical observations-would strengthen the theoretical transparency of the paper.

    5. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      The "number sense" refers to an imprecise and noisy representation of number. Many researchers propose that the number sense confers a fixed (exogenous) subjective representation of number that adheres to scalar variability, whereby the variance of the representation of number is linear in the number.

      This manuscript investigates whether the representation of number is fixed, as usually assumed in the literature, or whether it is endogenous. The two dimensions on which the authors investigate this endogeneity are the subject's prior beliefs about stimuli values and the task objective. Using two experimental tasks, the authors collect data that are shown to violate scalar variability and are instead consistent with a model of optimal encoding and decoding, where the encoding phase depends endogenously on prior and task objectives. I believe the paper asks a critically important question. The literature in cognitive science, psychology, and increasingly in economics, has provided growing empirical evidence of decisionmaking consistent with efficient coding. However, the precise model mechanics can differ substantially across studies. This point was made forcefully in a paper by Ma and Woodford (2020, Behavioral & Brain Sciences), who argue that different researchers make different assumptions about the objective function and resource constraints across efficient coding models, leading to a proliferation of different models with ad-hoc assumptions. Thus, the possibility that optimal coding depends endogenously on the prior and the objective of the task, opens the door to a more parsimonious framework in which assumptions of the model can be constrained by environmental features. Along these lines, one of the authors' conclusions is that the degree of variability in subjective responses increases sublinearly in the width of the prior. And importantly, the degree of this sublinearity differs across the two tasks, in a manner that is consistent with a unified efficient coding model.

      We thank Reviewer #1 for her/his comments and for placing our work in a broader context.

      Comments:

      (1) Modeling and implementation of estimation task

      The biggest concern I have with the paper is about the experimental implementation and theoretical account of the estimation task. The salient features of the experimental data (Figure 1C) are that the standard deviations of subjects' estimated quantities are hump-shaped in the true stimulus x and that the standard deviation, conditional on the true stimulus x, is increasing in prior width. The authors attribute these features to a Bayesian encoding and decoding model in which the internal representation of the quantity is noisy, and the degree of noise depends on the prior - as in models of efficient coding (Wei and Stocker 2015 Nature Neuro; Bhui and Gershman 2018 Psych Review; Hahn and Wei 2024 Nature Neuro).

      The concern I have is about the final "step" in the model, where the authors assume there is an additional layer of motor noise in selecting the response. The authors posit that the subject's selection of the response is drawn from a Gaussian with a mean set to the optimally decoded estimate x*(r), and variance set to a free parameter sigma_0^2. However, the authors also assume that the Gaussian distribution is "truncated to the prior range." This truncation is a nontrivial assumption, and I believe that on its own, it can explain many features of the data.

      To see this, assume that there is no noise in the internal representation of x, there is only motor noise. This corresponds to a special case of the authors' model in which υ is set to 0. The model then reduces to a simple account in which responses are drawn from a Gaussian distribution centered at the true value of x, but with asymmetric noise due to the truncation. I simulated such a model with sigma_0=7. The resulting standard deviations of responses for each value of x (based on 1000 draws for each value of x), across the three different priors, reproduce the salient patterns of the standard deviation in Figure 1C: i) within each condition, the standard deviation is hump-shaped and peaks at x=60 and ii) conditional on x, standard deviation increases in prior width. The takeaway is that this simple model with only truncated motor noise - and without any noisy or efficient coding of internal representations - provides an alternative channel through which the prior affects behavior.

      Of course, this does not imply that subjects' coding is not described by the efficient encoding and decoding model posited by the authors. However, it does suggest an important alternative mechanism for the authors' theoretical results in the estimation task. Moreover, some of the quantitative conclusions about the differences in behavior with the discrimination task would be greatly affected by the assumption of truncated motor noise.

      Turning to the experiment, a basic question is whether such a truncation was actually implemented in the design. That is, was the range of the slider bar set to the range of the prior? (The methods section states that the size on the screen of the slider was proportional to the prior width, but it was unclear whether the bounds of the slider bar changed with the prior). If the slider bar range did depend on the prior, then it becomes difficult to interpret the data. If not, then perhaps one can perform analyses to understand how much the motor noise is responsible for the dependence of the standard deviation on both x and the prior width. Indeed, the authors emphasize that their model is best fit at α=0.48, which would seem to imply that the best fitting value of υ is strictly positive. However, it would be important to clarify whether the estimation procedure allowed for υ=0, or whether this noise parameter was constrained to be positive (i.e., clarify whether the estimation assumed noisy and efficient coding of internal representations).

      We thank Reviewer #1 for her/his close attention to the motor-noise component of our model, in particular its truncation at the border of the prior. We agree that the truncated motor noise should be examined more closely as it affects the variance of responses. We address here the questions raised by the reviewer, and we detail the new analyses we have conducted.

      First, regarding the experimental paradigm, we note that this truncation was indeed implemented in the design, i.e., the range of the slider bar corresponded to the range of the prior (we now indicate this more clearly in the manuscript). Subjects thus were not able to select an estimate that was not in the support of the prior, and it is precisely for this reason that we model the selection step with a truncated distribution, so that the model is consistent with the experimental setup. This truncation naturally decreases the response variability near the bounds, and this may affect differently the overall variability for the different priors, as noted by the reviewer in her/his simulations. We have conducted a series of analysis to investigate this question.

      First, we consider a model in which there is no cognitive noise, but only motor noise. To answer one of the reviewer’s questions, the model-fitting procedure did allow for a vanishing cognitive noise (𝜈 = 0), i.e., it allowed for such a “motor-noise-only” mechanism to be the main account of the data. This value (𝜈 = 0), however, does not maximize the likelihood of the model, and thus this hypothesis is not the best account of the data. Nevertheless, we fit a model that enforces the absence of cognitive noise (i.e., with 𝜈 = 0). The BIC of this “motor-noise-only” model is higher than that of our best-fitting model by more than 1100, indicating very strong support for the best-fitting model, which features a positive cognitive noise (𝜈 > 0), and 𝛼 = 1/2, as in our theoretical proposal.

      Furthermore, the standard deviation of responses predicted by the motor-noise-only model overestimates substantially the variability of subjects' responses in the Narrow and Medium conditions (Figure 4, panel b), while the predictions of the best-fitting model are much closer to the behavioral data (panel a). Finally, the variances predicted by this model do not increase linearly with the prior width (contrary to the behavioral data). Instead, the variance increases more between the Narrow and the Medium priors than between the Medium and the Wide priors, as the effects of the bounds attenuate with the wider prior (panel c, solid green line).

      To further this analysis we fit in addition a model with no cognitive noise (𝜈 = 0), but in which we now allow the degree of motor noise, 𝜎<sub>0</sub>, to depend on the prior. Our reasoning is that if the truncated motor noise were the sole explanation for the increase in subjects' variance with the prior width, then we would expect the noise levels for the three priors to be roughly equal. We find instead that they are different (with values of 5.9, 8.3, and 9.8, for the prior widths 20, 40, and 60, respectively, when pooling subjects; and when fitting subjects individually the distributions of parameter values exhibit a clear increase; see panels c and d above). This model moreover yields a BIC higher by more than 590 than our best-fitting model. We note in addition that these parameter values differ in such a way that they result in response variances that are a linear function of the prior width, as found in the behavioral data, although they overestimate the subjects' variances (panel c, dotted green line). This linear increase is directly predicted by our best-fitting model, which has one less parameter (2 vs. 3), and which moreover accurately predicts the variability of subjects across priors (panel c, pink line). Hence the data do not support a model with no cognitive noise and with only a constant, truncated motor noise.

      We also consider another possibility, that in addition to truncated motor noise there is in fact a degree of cognitive noise, but one that is insensitive to the width of the prior. In other words, there is cognitive imprecision, but it does not efficiently adapt to the prior range, as in our proposal. This corresponds to setting 𝛼 = 0, in our model; but this specification of the model results in a poor fit, with a BIC higher by more than 300 than that of the best-fitting model, whose cognitive noise scales with the exponent 𝛼 = 1/2, consistent with our theory. Thus our data do not support the hypothesis of a cognitive noise that does not scale with the prior range; instead, subjects' responses support a model in which the variance of the cognitive noise increases linearly with the prior range.

      We note in addition that there is inter-subject variability: different subjects have different degrees of imprecision. But if the source of the imprecision was the truncated motor noise, then different degrees of truncated noise should result in different relationships between the behavioral variance and the prior widths: subjects with smaller noise should be relatively insensitive to the width of the prior, while subjects with greater noise should be more sensitive. In that case, when fitting the subjects with the model in which the imprecision scales as a power of the width, we should expect subjects to exhibit a diversity of best-fitting parameter values 𝛼. Instead, as noted, we find that the data is best captured by a single exponent 𝛼 = 1/2, equal for all the subjects. This suggests that although the “baseline level” of the imprecision may differ per subject, the way that their imprecision increases as a function of the prior width is the same for all the subjects, a behavior that is not explained by truncated noise alone.

      Furthermore, Prat-Carrabin, Harl, and Gershman 2025 present behavioral results obtained in a similar numerosity-estimation task, with the same prior ranges, but with the experimental difference that the slider was not limited to the range of the current prior: instead it had the same width in all three conditions, and covered in all trials a range wider than that of the Wide prior (from 25 to 95). The behavioral variance observed in this study increases linearly with the prior range, as in our results. Thus we conclude that the linear increase in subjects' variability does not originate in the bounds of the experimental slider.

      Finally, Prat-Carrabin et al. 2025 presents an fMRI study involving a similar numerosityestimation experiment. This study shows that numerosity-sensitive neural populations in human parietal cortex adapt their tuning properties to the current numerical range, resulting in less precise neural encoding when the range is wider. This substantiates the notion that the degree of imprecision in cognitive noise adapts to the prior range, as in our proposal.

      Overall, we conclude that the linear increase of behavioral variability that we document originates in the endogenous adaptation, across conditions, of the amount of imprecision in the internal encoding of numerosities.

      We now include these analyses in a new section of the Methods (p. 24-27), which we summarize in the main text (p. 7-8). The Figure above is now included (as Figure 4). We also now cite the references mentioned by Reviewer #1 and which we had not already cited (Bhui and Gershman 2018 Psych Review; Hahn and Wei 2024 Nature Neuro).

      References:

      Prat-Carrabin, A., Harl, M. V., & Gershman, S. J. (2025). Fast efficient coding and sensory adaptation in gain-adaptive recurrent networks (p. 2025.07.11.664261). bioRxiv. https://doi.org/10.1101/2025.07.11.664261

      Prat-Carrabin, A., de Hollander, G., Bedi, S., Gershman, S. J., & Ruff, C. C. (2025). Distributed range adaptation in human parietal encoding of numbers (p. 2025.09.25.675916). bioRxiv. https://doi.org/10.1101/2025.09.25.675916

      (2) Differences across tasks

      A main takeaway from the paper is that optimal coding depends on the expected reward function in each task. This is the explanation for why the degree of sublinearity between standard deviation and prior width changes across the estimation and discrimination task. But besides the two different reward functions, there are also other differences across the two tasks. For example, the estimation task involves a single array of dots, whereas the discrimination task involves a pair of sequences of Arabic numerals. Related to the discussion above, in the estimation task the response scale is continuous whereas in the discrimination task, responses are binary. Is it possible that these other differences in the task could contribute to the observed different degrees of sublinearity? It is likely beyond the scope of the paper to incorporate these differences into the model, but such differences across the two tasks should be discussed as potential drivers of differences in observed behavior.

      If it becomes too difficult to interpret the data from the estimation task due to the slider bar varying with the prior range, then which of the paper's conclusions would still follow when restricting the analysis to the discrimination task?

      There are indeed several differences between the estimation and discrimination tasks that could, in principle, contribute to the quantitative differences observed between them. The fact that the estimation task requires a continuous numerical report whereas the discrimination task involves a binary choice is captured in our model by incorporating distinct loss functions for the two tasks (Eq. 4). This distinction is a key element of the theoretical framework, as it determines the optimal allocation of representational precision. We agree with Reviewer #1 that another important difference is that the estimation task involves non-symbolic dot arrays while the discrimination task uses short sequences of Arabic numerals, which could also affect performance through distinct perceptual or cognitive processes. Although we cannot exclude this possibility, it is unclear why such a difference in stimulus format would produce the specific quantitative patterns that we observe — and that are predicted by our proposal, namely, the sublinear scalings with task-dependent exponents. Each experiment, taken independently, supports the model's central prediction that the precision of internal representations scales sublinearly with the width of the prior distribution. Taken together, the two tasks show that this dependence itself varies with the observer's objective, confirming that perceptual precision is endogenously determined by both the statistical context and the task goal.

      We agree with Reviewer #1 that this point should be mentioned; we now do so in the Discussion (p. 17-18).

      (3) Placement literature

      One closely related experiment to the discrimination task in the current paper can be found in Frydman and Jin (2022 Quarterly Journal of Economics). Those authors also experimentally vary the width of a uniform prior in a discrimination task using Arabic numerals, in order to test principles of efficient coding. Consistent with the current findings, Frydman and Jin find that subjects exhibit greater precision when making judgments about numbers drawn from a narrower distribution. However, what the current manuscript does is it goes beyond Frydman and Jin by modeling and experimentally varying task objectives to understand and test the effects on optimal coding. This contribution should be highlighted and contrasted against the earlier experimental work of Frydman and Jin to better articulate the novelty of the current manuscript.

      We thank Reviewer #1 and we agree that the work of Frydman and Jin is highly relevant to our study. Instead of comparing our contributions to theirs, we have decided to have a close look at their data, in light of our theoretical proposal. This enables us to test the predictions of our theory against human choices made in a rather different decision situation than that of our discrimination task.

      Thus we looked, in their data, at the participants' probability of choosing the risky lottery instead of the certain amount, as a function of the difference between the lottery's expected value (pX) and the certain amount (C; we also added a small bias term to the certain option; such bias was not necessary with our discrimination data, presumably because of the inherent symmetry of our task).

      We find, as did Frydman and Jin, and similarly to our discrimination task, that the participants are more precise when the proposed amounts are sampled from a Narrow prior, in comparison to a Wide prior (see figure above, first panel). But we also find, as in our discrimination task, that when normalizing the value difference by the prior width participants are more sensitive to this normalized difference in the Wide condition than in the Narrow one, suggesting that their imprecision scales across conditions by a smaller factor than the prior width (last panel). And we find, consistent with our discrimination data and with our theory, that choice probabilities in the two conditions match very well when normalizing the difference by the prior width raised to the exponent 3/4 (third panel).

      Model fitting supports this observation. We fit the data to our model (described by Eq. 3), with the addition of a lapse probability and of a bias, and with different values of the exponent 𝛼. The best-fitting model is the one with 𝛼 = 3/4. Its BIC (35,419) is lower than those of the models with 𝛼 = 1, ½, and 0 (by 142, 39, and 514, respectively). It is also lower by 2.14 than a model in which 𝛼 is left as a free parameter (in which case the bestfitting 𝛼 is 0.68, a value not far from 3/4). We emphasize that these BIC values indicate that the hypotheses 𝛼 = 0 and 𝛼 =1 are clearly rejected, i.e., the participants' imprecision increases with the prior width (𝛼 > 0), but sublinearly (𝛼 < 1). In other words, the responses collected by Frydman and Jin in a risky-choice task are quantitatively consistent with our results obtained in a number-discrimination task, and they further substantiate our model of endogenous precision.

      We moreover note that their proposed model is similar to ours, in that the decision-maker is allowed to optimize a noisy encoding scheme to the prior, subject to a ‘capacity constraint’ on the number 𝑛 of encoding signals that can be obtained. Crucially, this capacity constraint is assumed to be a property of the decision-maker that does not change across priors, and thus 𝑛 is fixed across prior widths. Therefore, their model predicts that the participants' imprecision should scale linearly with the prior width (this is also what we obtain in our model if we don’t optimize a similar parameter; see the revised presentation of the model on p. 12-13). We note that when they fit this parameter, 𝑛, separately across conditions, they find that it is larger with the wider prior. This is precisely what our model of endogenous precision predicts. In turn this predicts a sublinear scaling of the imprecision, instead of the linear one that would result from a fixed 𝑛, and indeed we find a sublinear scaling in both their dataset and ours. What is more, in both datasets the sublinear scaling is best captured by the exponent 𝛼 = 3/4, as we predict.

      This analysis of another independent dataset obtained with a different experimental paradigm significantly strengthens our conclusions. Thus we added to the Results section a new subsection discussing this analysis, and the figure above now appears as Figure 3. We also mention it in the Introduction (l. 87-89) and in the Discussion (l. 556-557).

      Reviewer #2 (Public review):

      Summary:

      This paper provides an ingenious experimental test of an efficient coding objective based on optimization as a task success. The key idea is that different tasks (estimation vs discrimination) will, under the proposed model, lead to a different scaling between the encoding precision and the width of the prior distribution. Empirical evidence in two tasks involving number perception supports this idea.

      Strengths:

      The paper provides an elegant test of a prediction made by a certain class of efficient coding models previously investigated theoretically by the authors.

      The results in experiments and modeling suggest that competing efficient coding models, optimizing mutual information alone, may be incomplete by missing the role of the task.

      We thank Reviewer #2 for her/his positive comments on our work.

      Weaknesses:

      The claims would be more strongly validated if data were present at more than two widths in the discrimination experiment.

      We agree that including additional prior widths would allow for a more detailed validation of the predicted scaling law, in particular in the discrimination task. Our design choices across the two experiments reflect a trade-off between the number of prior widths and the number of trials per condition. In the estimation task, we include three widths because this is necessary to identify all three parameters of the model: the variance of the motor noise , the baseline variance of internal imprecision (𝜈<sup>2</sup>), and the scaling exponent (𝛼). Extending both tasks to include additional prior widths would indeed provide a more robust test of the predicted scaling law. We now note this point in the revised Discussion (p. 17).

      A very strong prediction of the model -- which determines encoding entirely from prior and task -- is that Fisher Information is uniform throughout the range, strongly at odds with the traditional assumption of imprecision increasing with the numerosity (Weber/Fechner law). This prediction should be checked against the data collected. It may not be trivial to determine this in the Estimation experiment, but should be feasible in the Discrimination experiment in the Wide condition: Is there really no difference in discriminability at numbers close to 10 vs numbers close to 90? Figure 2 collapses over those, so it's not evident whether such a difference holds or not. I'd have loved to look into this in reviewing, but the authors have not yet made their data publicly available - I strongly encourage them to do so.

      Importantly, the inverse u-shaped pattern in Figure 1 is itself compatible with a Weber's-law-based encoding, as shown by simulation in Figure 5d in Hahn&Wei [1]. This suggests a potential competing variant account, in apparent qualitative agreement with the findings reported: the encoding is compatible with Fisher's law, and only a single scalar, the magnitude of sensory noise, is optimized for the task for the loss function (3). As this account would be substantially more in line with traditional accounts of numerosity perception - while still exhibiting taskdependence of encoding as proposed by the authors - it would be worth investigating if it can be ruled out based on the data gathered for this paper.

      References:

      [1] Hahn & Wei, A unifying theory explains seemingly contradictory biases in perceptual estimation, Nature Neuroscience 2024

      Indeed our efficient-coding model predicts that a uniform should result in a constant Fisher-information function, and we agree with Reviewer #2 that this is at odds with the common assumption that the imprecision increases with the magnitude. To investigate this possibility, we now consider, in the revised manuscript, a more general model of Gaussian encoding, in which the internal representation, 𝑟, is normally distributed around an increasing transformation of the number, 𝜇(𝑥), as

      𝑟|𝑥~𝑁(𝜇(𝑥), 𝜈<sup>2</sup>𝑤<sup>2 𝛼</sup>),

      where the encoding function, 𝜇(𝑥), can be either linear (𝜇(𝑥) = 𝑥) or logarithmic (𝜇(𝑥) = log (𝑥)). This allows us to test whether the data are better captured by a uniform Fisher information (as predicted by the linear encoding under a uniform prior) or by a compressed, Weber-like representation.

      We note, first, that in both tasks our conclusions regarding the dependence of the imprecision on the prior width remain unchanged, whether we choose the linear encoding or the logarithmic encoding. With both choice of encoding, the estimation task is best fit by a model with 𝛼 = 1/2, and the discrimination task by a model with 𝛼 = 3/4, implying a sublinear scaling of the variance with the width of the prior, in quantitative agreement with our theory.

      In the estimation task, the logarithmic encoding yields a significantly lower BIC than the linear one, by more than 380 (see Table 1). The results are less clear in the discrimination task, where the BIC with the logarithmic encoding is lower by 2.1 when pooling together the responses of all the subject, but it is larger by 2.6 when fitting each subject individually. We conduct in addition a “Bayesian model selection” procedure, to estimate the relative prevalence of each encoding among subjects. The resulting estimate of the fraction of the population that is best fit by the logarithmic encoding is 87.6% in the estimation task, and 45.9% in the discrimination task (vs. 12.4% and 54.1% for the linear encoding).

      To further investigate the behavior of subject in the Discrimination task, we look at their proportion of correct choices in the Wide and Narrow conditions, for the trials in which both averages are below the middle value of the prior, and for those in which both are above the middle value. We find no significant difference in the Narrow condition (see Figure below). In the Wide condition, the proportion of correct responses appear larger when the averages are small (with a significant difference when binning together the trials in which the absolute difference between the averages is between 4 and 12; Fisher's exact test p-value: 0.030).

      To complement this analysis, we fit a probit model with lapses, which is equivalent to our Gaussian model with linear encoding, but allowing the noise scale parameter to differ when both averages are above, or below, the middle value of the prior. We fit this model separately in each condition, only on the trials in which both averages are either above or below the middle value; and we test a more constrained model in which the scale parameter is equal for both small and large averages. In the Narrow condition, a likelihood-ratio test does not reject the null hypothesis that the scale parameter is constant (𝜒<sup>2</sup>(1) = 0.026, 𝑝 = 0.87), but in the Wide condition this hypothesis is rejected (𝜒<sup>2</sup> (1) = 7.6, 𝑝 = 0.006). In this condition the best-fitting scale parameter is 29% larger (9.4 vs. 6.3) with the large averages than with the small averages, pointing to a larger imprecision with the larger numbers.

      These results and the prevalence of the Weber/Fechner encoding prompt us to consider, in our efficient-coding model, the hypothesis that a logarithmic compression is an additional constraint on the possible encoding schemes. In our model, the internal representation (𝑟) could take any form as long as its Fisher information verified the constraint in Eq. 5 on the integral of its square-root. We now consider a strong, additional constraint: that over the support of the prior, the Fisher information of the signal must be of the form that one would obtain with a logarithmic encoding, i.e., 𝐼(𝑥) ∝ 1/𝑥<sup>2</sup>. (For the sake of generality we choose this specification instead of directly assuming a logarithmic encoding, because other types of encoding schemes yield a Fisher information of this form, e.g., one with “multiplicative noise” (Zhou et al., 2024); we do not seek, here, to distinguish between these different possibilities). We solve the same efficient-coding optimization problem (Eq. 6), but now with this additional constraint. We find that the resulting optimal Fisher information is approximately:

      , for the estimation task,

      and , for the discrimination task,

      for any 𝑥 on the support of the prior, and where 𝑥<sub>mid</sub> is the middle of the prior and 𝜃 is a constant. These Fisher-information functions differ from the one previously obtained without the additional constraint (Eq. 9), in that they fall off as 1/𝑥<sup>2</sup>, consistent with our additional constraint. However, we note that the dependence on the prior width, 𝑤, is identical: here also, the imprecision is proportional to , in the estimation task, and to 𝑤<sup>3/4</sup>, in the discrimination task.

      In its logarithmic variant (𝜇(𝑥) = log (𝑥)), the Fisher information of the model of Gaussian representations that we have considered throughout is 1/(𝑥 𝜈 𝑤<sup>𝛼</sup>)<sup>2</sup>. It is thus consistent with the predictions just presented, if 𝛼 = 1/2 for the estimation task, and 𝛼 = 3/4 for the discrimination task, i.e., the two values that best fit the data.

      This is precisely the model suggested by Reviewer #2. Overall, we conclude that with both linear and logarithmic encoding schemes, our efficient-coding model — wherein the degree of imprecision is endogenously determined — accounts for the task-dependent sublinear scaling of the imprecision that we observe in behavioral data. As for the imprecision across numbers, a sizable fraction of subjects, particularly in the estimation task, are best fit by the logarithmic encoding, consistent with previous reports that numbers are often represented on a compressed, approximately logarithmic scale. This encoding may itself reflect an efficient adaptation to a long-term environmental prior that is skewed, with smaller numbers occurring more frequently, leading to greater representational precision. This pattern is less clear in the discrimination task. It is possible that the rate at which the precision decreases across numbers itself depends on the task, such that not only the overall level of imprecision, but also its variation across numbers, may be modulated by the task's demands. In this study we have focused on the endogenous choice of the overall precision, but an avenue for future research would be to examine how this adaptation interacts with the detailed shape of the encoding across numbers.

      In the revised manuscript, we have modified the presentation of the model to include the transformation 𝜇(𝑥) (p. 6-7 and 10-11). We have updated accordingly Table 1 (shown above; p. 24), which reports the BICs of all the models for the estimation task (and which now includes the models with logarithmic encoding). There is now a section in the Results dedicated to the question of the logarithmic compression, which includes the efficientcoding model constrained by the logarithmic encoding (p. 15-16). The results on the performance of subjects with larger numbers are presented in Methods (p. 29-31), and mentioned in the main text (p. 14-15). The Methods also provides details about the efficient-coding model with logarithmic encoding (p. 32-33). These results are further commented on in the Discussion (p. 18). Finally, the data and code are now available online at this address: https://osf.io/d6k3m/ , which we note on p. 33.

      Reference

      Zhou, J., Duong, L. R., & Simoncelli, E. P. (2024). A unified framework for perceived magnitude and discriminability of sensory stimuli. Proceedings of the National Academy of Sciences, 121(25), e2312293121. https://doi.org/10.1073/pnas.2312293121

      Reviewer #3 (Public review):

      Summary:

      This work demonstrates that people's imprecision in numeric perception varies with the stimulus context and task goal. By measuring imprecision across different widths of uniform prior distributions in estimation and discrimination tasks, the authors find that imprecision changes sublinearly with prior width, challenging previous range normalization models. They further show that these changes align with the efficient encoding model, where decision-makers balance expected rewards and encoding costs optimally.

      Strengths:

      The experimental design is straightforward, controlling the mean of the number distribution while varying the prior width. By assessing estimation errors and discrimination accuracy, the authors effectively highlight how imprecision adjusts across conditions.

      The model's predictions align well with the data, with the exponential terms (1/2 and 3/4) of imprecision changes matching the empirical results impressively.

      We thank Reviewer #3 for his/her positive comments on our work.

      Weaknesses:

      Some details in the model section are unclear. Specifically, I'm puzzled by the Wiener process assumption where r∣x∼N(m(x)T,s^2T). Does this imply that both the representation of number x and the noise are nearly zero at the beginning, increasing as observation time progresses? This seems counterintuitive, and a clearer explanation would be helpful.

      In the original formulation of the model, indeed both the mean of the representation and its variance are nearly zero when T is also near zero, but in such a way that the Fisher information, 𝑇(𝑚′(𝑥)/𝑠)<sup>2</sup>, is proportional to 𝑇. We note that a different specification, with a mean 𝑚(𝑥) (instead of 𝑚(𝑥)𝑇) and a variance 𝑠<sup>2</sup>/𝑇 (instead of 𝑠<sup>2</sup>𝑇), i.e., 𝑟|𝑥~𝑁(𝑚(𝑥), 𝑠<sup>2</sup>/𝑇), for 𝑇 > 0, would result in the same Fisher information.

      In any event, in the revised manuscript, we now formulate the model differently. Specifically, we assume that the encoding results from an accumulation of independent, identically-distributed signals, but the precision of each signal is limited, and each of them entails a cost. Formally, we posit, first, that the Fisher information of one signal, 𝐼<sub>1</sub>(𝑥), is subject to the constraint:

      This constraint appears in many other efficient-coding models in the literature (Wei & Stocker 2015, 2016; Wang et al. 2016; Morais & Pillow, 2018; etc.), and it arises naturally for unidimensional encoding channels (Prat-Carrabin & Woodford, 2001; e.g., for a neuron with a sigmoidal tuning curve, it is equivalent to assuming that the range of possible firing rates is bounded). Second, we assume that the observer incurs a cost each time a signal is emitted (e.g., the energy resources consumed by action potentials). The total cost is thus proportional to the number of signals, which we denote by 𝑛. More signals, however, allow for a better precision: specifically, under the assumption of independent signals, the total Fisher information resulting from 𝑛 signals is the sum of the Fisher information of each signal, i.e., 𝐼(𝑥) = 𝑛𝐼<sub>1</sub>(𝑥).

      A tradeoff ensues between the increased precision brought by accumulating more signals, and the cost of these signals. We assume that the observer chooses the function 𝐼<sub>1</sub>(.) and the number 𝑛 of signals that solve the minimization problem subject to ,

      where 𝜆 > 0. We can first solve this problem for the Fisher information of one signal, 𝐼<sub>1</sub>(𝑥). In the case of a uniform prior of width 𝑤, we find that it is zero outside of the support of the prior, and

      for any 𝑥 on the support of the prior. This intermediate result corresponds to the optimal Fisher information of an observer who is not allowed to choose the number of signal, 𝑛, (and who receives instead 𝑛 = 1 signal). It is the solution predicted by the efficient-coding models mentioned above, that include the constraint on 𝐼<sub>1</sub>(𝑥), but that do not allow for the observer to choose the amount of signals, 𝑛. With this solution, the scale of the observer's imprecision, , is proportional to 𝑤, and it does not depend on the task — contrary to our experimental results.

      Solving the optimization problem for 𝑛, in addition to 𝐼<sub>1</sub>(𝑥), we find that with a uniform prior the optimal number is proportional to 𝑤 in the estimation task, and to in the discrimination task (specifically, treating 𝑛 as continuous, we obtain ). In other words, the observer chooses to obtain more signals when the prior is wider, and in a way that depends on the task. We give the general solution for the total Fisher information, 𝐼(𝑥) = 𝑛𝐼<sub>1</sub>(𝑥), in the case of a prior 𝜋(𝑥) that is not necessarily uniform:

      where 𝜃 = 𝜆/𝐾. This is of course the same solution that we obtained in the original manuscript.

      We hope that this new formulation of the efficient-coding model will seem more intuitive to the reader (p. 12-13 in the revised manuscript).

      The authors explore range normalization models with Gaussian representation, but another common approach is the logarithmic representation (Barretto-García et al., 2023; Khaw et al., 2021). Could the logarithmic representation similarly lead to sublinearity in noise and distribution width?

      We agree with Reviewer #3 that a common approach when modeling the perception of numbers is to consider a logarithmic encoding. We have conducted several analyzes that examine this proposal. These are presented in detail in our response to a comment of Reviewer #2, above (p. 11-14 of this document). We summarize shortly our findings, here:

      (i) A model with a logarithmic encoding better fits a majority of subjects in the estimation task, but a bit less than half the subjects in the discrimination task.

      (ii) The examination of the performance of subjects in the discrimination task, however, suggests that in the Wide condition they discriminate slightly better the small numbers, as compared to the larger numbers.

      (iii) We consider a constrained version of our efficient-coding model, in which the Fisher information must be consistent with that of a logarithmic encoding (i.e., decreasing as 1/𝑥<sup>2</sup>); we find that the resulting optimal Fisher information depends on the prior width in the same way than without the constraint, i.e., a scaling of the imprecision with , in the estimation task, and with 𝑤<sup>3/4</sup>, in the discrimination task.

      (iv) When considering the model with logarithmic encoding, we find that it best fits the data when its imprecision scales with the width with the same exponents, i.e., , in the estimation task (𝛼 = 1/2), and 𝑤<sup>3/4</sup>, in the discrimination task (𝛼 = 3/4). In other words, the data support the predictions of our theoretical model.

      In the revised manuscript, we have modified accordingly the presentation of the model (p. 6-7 and 10-11), the Tables 1 (p. 24) and 2 (p. 30) which report the BICs. There is now a section in the Results dedicated to the question of the logarithmic compression, including the efficient-coding model constrained by the logarithmic encoding (p. 15-16). The results on the performance of subjects with larger numbers are presented in Methods (p. 29-31), and mentioned in the main text (p. 15-16). The Methods also provides details about the efficient-coding model with logarithmic encoding (p. 32-33). These results are further commented on in the Discussion (p. 18). Finally, we now cite the articles mentioned by Reviewer #3 (Barretto-García et al., 2023; Khaw et al., 2021).

      Additionally, Heng et al. (2020) found that subjects did not alter their encoding strategy across different task goals, which seems inconsistent with the fully adaptive representation proposed here. I didn't find the analysis of participants' temporal dynamics of adaptation. The behavioral results in the manuscript seem to imply that the subjects adopted different coding schemes in a very short period of time. Yet in previous studies of adaptation, experimental results seem to be more supportive of a partial adaptive behavior (Bujold et al., 2021; Heng et al., 2020), which might balance experimental and real-world prior distributions. Analyzing temporal dynamics might provide more insight. Noting that the authors informed subjects about the shape of the prior distribution before the experiment, do the results in this manuscript suggest a top-down rapid modulation of number representation?

      We thank Reviewer #3 for his/her comment and for pointing to these articles. The Reviewer raises several points — that of the dynamics of adaptation, that of the adaptation to the prior, and that of the adaptation to the task. We address each of them.

      To investigate the dynamics of the subjects’ adaptation, we examined separately, in each task, the responses obtained in the trials in the first and second halves of each condition. In the estimation task, the standard deviations of responses, as a function of the presented number and of the prior width, are very similar in the two halves (see Figure 8, panel a). The Bonferroni-Holm-corrected p-values of Levene's tests of equality of the variances across the two halves are all above 0.13, and thus we do not reject the hypothesis that the variance in the first half of the trials is equal to the variance in the second half. Moreover, the variance in both halves appear to be a linear function of the width, rather than the squared width (panel b). We conclude that the behavior of subjects in the estimation task is stable across each experimental condition, including the sublinear scaling of their imprecision.

      In the discrimination task, the subjects' choice probabilities, as a function of the difference between the averages of the red and blue numbers, are similar in the first and second halves of trials (panel c). The Bonferroni-Holm-corrected p-values of Fisher exact tests of equality of proportions (in bins of the average difference that contain about 500 trials each) are all above 0.9, and thus we do not reject the hypothesis that the choice probabilities are equal, in the first and second halves of the trials. Furthermore, the choice probabilities as a function of the absolute average difference normalized by the prior width raised to the exponent 3/4 are all similar, across session halves and across prior widths, suggesting that the sublinear scaling that we find is a stable behavior of subjects (panel d).

      Overall, we conclude that the behavior we exhibit in both tasks is stable over the course of each experimental condition. We note that in both experiments, subjects were explicitly informed of the prior distribution at the beginning of each condition, and each condition included two preliminary training phases that familiarized them with the prior (the specifics for each task are detailed in the Methods section).

      As pointed out by Reviewer #3, Heng et al. (2020) and Bujold et al. (2021) report a partial adaptation of encoding to recently experienced distributions. We note that in our study, a sizable fraction of subjects, particularly in the estimation task, are best fit by the logarithmic encoding. This suggests that, while subjects adapt to the experimental prior, they retain a residual logarithmic compression — an encoding that itself would be efficient under a long-term, skewed prior in which smaller numbers are more frequent. In that sense our findings are thus consistent with the partial adaptation of Heng et al. (2020) and Bujold et al. (2021). At the same time, the same sublinear scaling of imprecision that we find in our study has been obtained in a numerosity-estimation task in which the prior was changed on every trial (Prat-Carrabin et al., 2025), indicating that the adaptation to the prior can occur quickly (on the order of a second) — possibly through a fast top-down modulation of the encoding, as suggested by Reviewer #3. These findings suggest that on a short timescale the encoding adapts efficiently to the prior (as evidenced by the scaling in imprecision), but within structural constraints (the logarithmic encoding).

      Regarding the adaptation to the task, Heng et al. (2020) indeed do not find subjects to be adapting their encoding, across two discrimination tasks (one in which the subject is rewarded for making the correct choice, and one in which the subject is rewarded with the chosen option). A difference with our paradigm is that their task involves simultaneous presentation of two dot arrays, while our discrimination task uses two interleaved sequences of Arabic numerals. More importantly, we do not directly compare the encoding between the estimation and discrimination tasks. Instead, we show that within each task, the adaptation to the prior is quantitatively consistent with the optimal coding predicted for that task's objective, as reflected in the task-specific sublinear scaling exponents. Directly contrasting the encoding across tasks would be a very interesting direction for future work.

      In the revised manuscript, we present the analysis on the stability of subjects’ behavior in the Methods section (p. 29), and we mention it in the main text when presenting the results of the estimation task (p. 5) and of the discrimination task (p. 8-10). In the Discussion, we cite Heng et al. (2020) and Bujold et al. (2021) and comment on the adaptation to the prior and to the task (p. 18).

      Barretto-García, M., De Hollander, G., Grueschow, M., Polanía, R., Woodford, M., & Ruff, C. C. (2023). Individual risk attitudes arise from noise in neurocognitive magnitude representations. Nature Human Behaviour, 7(9), 15511567. https://doi.org/10.1038/s41562-023-01643-4

      Bujold, P. M., Ferrari-Toniolo, S., & Schultz, W. (2021). Adaptation of utility functions to reward distribution in rhesus monkeys. Cognition, 214, 104764. https://doi.org/10.1016/j.cognition.2021.104764

      Heng, J. A., Woodford, M., & Polania, R. (2020). Efficient sampling and noisy decisions. eLife, 9, e54962. https://doi.org/10.7554/eLife.54962

      Khaw, M. W., Li, Z., & Woodford, M. (2021). Cognitive Imprecision and SmallStakes Risk Aversion. The Review of Economic Studies, 88(4), 19792013. https://doi.org/10.1093/restud/rdaa044

      Recommendations for the authors:

      Reviewer #2 (Recommendations for the authors):

      (1) As mentioned above, the result of inverse u-shaped variability is in strong qualitative agreement with the predictions of a generic Bayesian encoding-decoding model of a flat prior, even under a standard encoding respecting Weber's law, as shown in Figure 5d in: Hahn & Wei, A unifying theory explains seemingly contradictory biases in perceptual estimation, Nature Neuroscience 2024. This paper should probably be cited.

      We now cite Hahn & Wei, 2024. We comment above on our analyzes regarding the logarithmic encoding.

      (2) "Requests for the data can be sent via email to the corresponding author" Why are the data not made openly available? Barring ethical or legal concerns (which are not apparent for this type of data), there is no reason not to make data and code open.

      "Requests for the code used for all analyses can be sent via email to the corresponding author." Same: why not make them open?

      We agree that it is good practice to make the data and code publicly available. They are now available here: https://osf.io/d6k3m/

      Reviewer #3 (Recommendations for the authors):

      The orange dot in Figure 1C does not appear to be described in the figure caption, although an explanation of it is mentioned in the main text.

      We thank Reviewer #3 for pointing out this omission. We now include explanations in the caption.

      I hope the authors will consider making their data publicly available on OSF or another platform.

      The data and code are now publicly available on OSF: https://osf.io/d6k3m/

    1. eLife Assessment

      This study provides important insights into how Trypanosoma cruzi populations diversify surface protein expression, showing through single-cell RNA sequencing that trans-sialidase-like genes are expressed heterogeneously across individual parasites, a pattern with clear implications for immune evasion. The evidence is convincing, supported by robust single-cell transcriptomic analyses, consistent quantitative measures of expression heterogeneity, and integration with genomic organization that together argue against purely stochastic expression.

    2. Reviewer #1 (Public review):

      Summary:

      The authors aimed to assess the variability in expression of surface protein multigene families between amastigote and trypomastigote Trypanosoma cruzi, as well as between individuals within each population. The analysis presented shows higher expression of multigene family transcripts in trypomastigotes compared to amastigotes and that there is variation in which copies are expressed between individual parasites. Notably, they find no clear subpopulations expressing previously characterised trans-sialidase groups and that no patterns of coexpressed TcS genes were evident within individual cells or subpopulations. They also note that TcS encoded in the core genome are more often expressed, compared to TcS genes encoded in other genome compartments.

      Strengths:

      Additionally, the authors successfully process methanol fixed parasites with the 10x Genomics platform. This approach is valuable for other studies where using live parasites for these methods is logistically challenging.

      In this second submission the authors show the kallisto mapping approach used is as robust as possible, and that this approach outperforms STAR mapping.

      Weaknesses:

      The authors describe a single experiment, which lacks repeats, controls or complementation with other approaches and the investigation is limited to the trans-sialidase transcripts.

      Comments on revised version:

      Thank you to the authors for taking the time to thoroughly address the peer review. The main concerns have now been addressed, and the manuscript edited to make points of confusion clearer.

    3. Reviewer #2 (Public review):

      Summary:

      This manuscript presents a valuable single-cell RNA-seq study on Trypanosoma cruzi, an important human parasite. It investigates the expression heterogeneity of surface proteins, particularly those from the trans-sialidase-like (TcS) superfamily, within amastigote and trypomastigote populations. The findings suggest a previously underappreciated level of diversity in TcS expression, which could have implications for understanding parasite-host interactions and immune evasion strategies. The use of single cell approaches to delve into population heterogeneity is strong. However, the study does have some limitations that need to be addressed.

      The focus on single-cell transcriptional heterogeneity in surface proteins, especially the TcS family, in T. cruzi is novel. Given the important role of these proteins in parasite biology and host interaction, the findings have potential significance.

      Strengths:

      The key finding of heterogeneous TcS expression in trypomastigotes is well-supported. The analysis comparing multigene families, single-copy genes, and ribosomal proteins highlights the unusual nature of the variation in surface protein coding genes.

      Weaknesses:

      While the manuscript identifies TcS heterogeneity, the functional implications of the different expression profiles remain speculative. The authors state it may reflect differences in infectivity, but no direct experimental evidence supports this.

      The manuscript lacks any functional validation of the single-cell findings. For instance, do the trypomastigote subpopulations identified based on TcS expression exhibit differences in infectivity, host cell tropism, or immune evasion? Such experiments would greatly strengthen the study.

      The authors identify a subpopulation of TcS genes that are highly expressed in many cells. However, it is unclear if these correspond to previously characterized TcS members with specific functions.

      The authors hypothesize that observed heterogeneity may relate to chromatin regulation. However, the study does not directly address these mechanisms. There are interesting connections to be made with what they identify as colocalization of genes within chromatin folding domains, but the authors do not fully explore this. It would be insightful to address these mechanisms in future work. [...]

      Comments on revisions:

      The novel version of the manuscript has improved and satisfied this reviewer.

    4. Reviewer #3 (Public review):

      The study aimed to address a fundamental question in T. cruzi and Chagas disease biology - how much variation is there in gene expression between individual parasites? This is particularly important with respect to the surface protein-encoding genes, which are mainly from massive repetitive gene families with 100s to 1000s of variant sequences in the genome. There is very little direct evidence for how expression of these genes is controlled. The authors conducted a single cell RNAseq experiment of in vitro cultured parasites with a mixture of amastigotes and trypomastigotes. Most of the analysis focused on the heterogeneity of gene expression patterns amongst trypomastigotes. They show that heterogeneity was very high for all gene classes, but surface-protein encoding genes were the most variable. Interestingly, in the case of the trans-sialidase genes, many sequence variants were detected in fewer than 5% of parasites while a subset of 31 others was detected in >40% if parasites, hinting at compartmentalised expression control within the gene family. The biology of the parasite (e.g. extensive post-transcriptional regulation) and potential technical caveats (e.g. high dropout rates across the genome) make it difficult to infer connections to actual protein expression on the parasite surface, but the results are a significant advance for the field.

      (1) Limit of detection and gene dropouts.

      An average of ~1100 genes are detected per parasite which indicates a dropout rate of over 90%. It appears that RNA for the "average" single copy 'core' gene is only detected in around 3% of the parasites sampled (Figure 2c: ~100 / 3192). While comparable with some other trypanosome scRNAseq studies, this remains a caveat to the interpretation that high cell-to-cell variability in gene expression is explained by biological factors. The argument would be more convincing if the dropout rates and expression heterogeneity were minimal for highly expressed housekeeping genes. The authors are appropriately cautious in their interpretation and acknowledge the need for further validation.

      (2) Heterogeneity across the board.

      The authors focus on the relative heterogeneity in RNA abundance for surface proteins from the multicopy gene families vs core genes. While multicopy gene sequences do show significantly more cell-to-cell variability, there is still surprisingly high inequality of expression amongst genes in other classes including single copy housekeeping and ribosomal genes. Again the biological relevance of the comparison is uncertain and the authors acknowledge the need for further investigation.

      This study provides some tantalising evidence that the expression of surface genes may vary substantially between individual parasites in a single clonal population. The study is also amongst the very first to apply scRNAseq to T. cruzi, so the broader data set will be an important resource for researchers in the field.

      Comment on revised version:

      The manuscript is significantly improved. The revised explanations and figures make several aspects of the data analysis and interpretation much clearer to me now. Thanks to the authors.

    5. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      The authors aimed to assess the variability in the expression of surface protein multigene families between amastigote and trypomastigote Trypanosoma cruzi, as well as between individuals within each population. The analysis presented shows higher expression of multigene family transcripts in trypomastigotes compared to amastigotes and that there is variation in which copies are expressed between individual parasites. Notably, they find no clear subpopulations expressing previously characterised trans-sialidase groups. The mapping accuracy to these multicopy genes requires demonstration to confirm this, and the analysis could be extended further to probe the features of the top expressed genes and the other multigene families also identified as variable.

      Strengths:

      The authors successfully process methanol-fixed parasites with the 10x Genomics platform. This approach is valuable for other studies where using live parasites for these methods is logistically challenging.

      Weaknesses:

      The authors describe a single experiment, which lacks controls or complementation with other approaches and the investigation is limited to the trans-sialidase transcripts.

      It would be more convincing to show either bioinformatically or by carrying out a controlled experiment, that the sequencing generated has been mapped accurately to different members of multigene families to distinguish their expression. If mapping to the multigene families is inaccurate, this will impact the transcript counts and downstream analysis.

      We thank the reviewer for raising these important points.

      We agree that the analysis of multigene families at the single-cell level is an important question, particularly given the heterogeneity observed across several of them. However, the aim of this short report is not to provide a comprehensive analysis of the entire experiment, but rather to focus on what we consider an important biological phenomenon observed in TcTS genes.

      Regarding the mapping accuracy of the reads, we acknowledge that this can limit the disambiguation of highly similar multicopy transcripts. This is, in fact, a common challenge when analyzing transcriptomic data from T. cruzi.

      To address this issue, we analyzed the sequence identity of the 3′ ends of TcS transcripts (defined as the 3′UTR plus 20% of the CDS region). As shown in Author response image 1, these regions display a median sequence identity of approximately 25%, indicating that sufficient sequence divergence exists for mapping algorithms to use during read assignment.

      In addition, it is important to note that kallisto, the software used in our analysis, was specifically designed to address multimapping reads through pseudoalignment combined with an expectation-maximization algorithm that probabilistically assigns reads across compatible transcripts.

      To directly assess performance, we simulated reads from the T. cruzi transcriptome used in this study (3′UTRs plus 20% of the CDS regions) and compared two mapping/counting strategies: (a) transcriptome pseudoalignment using kallisto, and (b) genome alignment followed by counting using STAR + featureCounts. The latter approximates the strategy implemented in CellRanger, the standard pipeline for quantifying expression levels from 10X Genomics single cell RNA-seq data. We found that kallisto recovered the simulated “true” counts with substantially higher accuracy than STAR + featureCounts (Pearson correlation: all genes, 0.991 vs 0.595; surface protein genes, 0.9996 vs 0.827; trans-sialidase (TcS) genes, 0.9998 vs 0.773). These results indicate that pseudoalignment is currently the optimal strategy for recovering the relative expression of highly similar gene family members (Author response image 1 C).

      Author response image 1

      (A) Distribution of pairwise sequence identity values calculated among the 3′-end regions of all transcripts (defined as the 3′UTR plus 20% of the coding sequence). (B) Distribution of read mapping coordinates over all multigene family transcripts normalized as percentage of the gene length (C) Scatter plots showing the correlation between estimated transcript counts obtained using kallisto (red) and STAR + featureCounts (grey) versus the corresponding simulated ground-truth values.

      Reviewer #2 (Public review):

      Summary:

      This manuscript presents a valuable single-cell RNA-seq study on Trypanosoma cruzi, an important human parasite. It investigates the expression heterogeneity of surface proteins, particularly those from the trans-sialidase-like (TcS) superfamily, within amastigote and trypomastigote populations. The findings suggest a previously underappreciated level of diversity in TcS expression, which could have implications for understanding parasite-host interactions and immune evasion strategies. The use of single-cell approaches to delve into population heterogeneity is strong. However, the study does have some limitations that need to be addressed.

      The focus on single-cell transcriptional heterogeneity in surface proteins, especially the TcS family, in T. cruzi is novel. Given the important role of these proteins in parasite biology and host interaction, the findings have potential significance.

      Strengths:

      The key finding of heterogeneous TcS expression in trypomastigotes is well-supported. The analysis comparing multigene families, single-copy genes, and ribosomal proteins highlights the unusual nature of the variation in surface protein-coding genes.

      Weaknesses:

      While the manuscript identifies TcS heterogeneity, the functional implications of the different expression profiles remain speculative. The authors state it may reflect differences in infectivity, but no direct experimental evidence supports this.

      The manuscript lacks any functional validation of the single-cell findings. For instance, do the trypomastigote subpopulations identified based on TcS expression exhibit differences in infectivity, host cell tropism, or immune evasion? Such experiments would greatly strengthen the study.

      We thank the reviewer for their careful reading of the manuscript. We agree that obtaining experimental evidence on the influence of multiple multigene families would represent a significant advancement in the field. However, we would like to emphasize that this study is presented as a short communication centered on a specific and biologically relevant observation within a single multigene family. The aim of the manuscript is to highlight what we consider an important biological phenomenon that raises hypotheses to be tested in future work.

      The influence of phenotypic heterogeneity and its possible advantages under environmental pressures has been previously proposed for Trypanosoma cruzi, related trypanosomatids, and other biological systems, ranging from bacteria to tumors (Seco-Hidalgo 2015, doi: 10.1098/rsob.150190 and Luzak 2021, doi: 10.1146/annurev-micro-040821-012953, for a comprehensive review on this topic). While the reviewer is correct in noting that our model does not demonstrate a functional role for TcTS heterogeneity, the experimental approaches required to address this question in a large multigene family are highly complex. This is particularly challenging in T. cruzi, where the study of multigene families is limited by the restricted set of available molecular biology tools (such as RNAi). Therefore, further experimental validation of these observations falls outside the scope of this short report.

      In this revised version, we have included additional validation and clarification of the results, as well as a more explicit discussion of their limitations. In addition, we present a preliminary analysis exploring potential mechanisms that could coordinate the observed expression patterns of the TcTS family.

      The authors identify a subpopulation of TcS genes that are highly expressed in many cells. However, it is unclear if these correspond to previously characterized TcS members with specific functions.

      The TcS subgroup with a high frequency of detection comprises 31 genes, none of which belong to the catalytically active Group I trans-sialidases. Instead, this subgroup includes members of Groups II, III, IV, V, VI, and VIII. This information has been added to Supplementary Table 3 and is now stated in the revised manuscript.

      The authors hypothesize that observed heterogeneity may relate to chromatin regulation. However, the study does not directly address these mechanisms. There are interesting connections to be made with what they identify as the colocalization of genes within chromatin folding domains, but the authors do not fully explore this. It would be insightful to address these mechanisms in future work.

      In response to the reviewer’s and editorial team’s request for additional mechanistic insight into the regulatory processes that may be involved in the observed patterns, we have expanded the revised manuscript to discuss how the genomic context of TcS loci could contribute to the observed heterogeneity in TcS expression. As noted in the original version of the manuscript, TcS genes and other surface-protein gene families are largely partitioned into discrete genomic compartments, whose expression has been reported to be regulated by epigenetic control of chromatin-folding domains (doi.org/10.1038/s41564-023-01483-y). However, we previously showed that TcS genes detected in a high proportion of cells are, in most cases, dispersed throughout the genome, arguing against a model in which their preferential expression results from colocalization within a small number of ubiquitously activated chromatin domains. In response to the reviewer’s suggestion, we performed a more detailed analysis of the genomic locations of these TcS genes. We found that many of them are localized within the core compartment (new Figure 5). Because the core compartment is enriched for conserved, housekeeping genes that typically display more constitutive expression (doi.org/10.1038/s41564-023-01483-y), whereas the disruptive compartment is enriched for lineage-specific multigene families associated with variable, stage-specific, and recently reported stochastic expression (doi.org/10.1038/s41467-025-64900-2), our results are consistent with a model in which compartment-specific regulatory mechanisms (in addition to post-transcriptional regulation) influence the differential cellular expression of core- versus disruptive-located TcS genes. We have incorporated these results and discussion in the revised manuscript.

      The merging of technical replicates needs further justification and explanation as they were not processed through separate experimental conditions. While barcodes were retained, it would be informative to know how well each technical replicate corresponds with the other. If both datasets were sequenced on the same lane, the inclusion of technical replicates adds noise to the analysis.

      Regarding technical details, we now include the total number of mapped reads and average number of reads mapped per cell (new paragraph in the Methods section.

      The technical replicates consist of a single Illumina library that was sequenced in two separate runs. As this approach is expected to be highly reproducible, we merged both runs into a single count table. To support this decision, we assessed the concordance between the two sequencing runs and observed an almost perfect correlation between them (Author response image 2).

      Author response image 2.

      Correlation analysis of number of reads assigned to cells between technical replicate 1 and technical replicate 2.

      While the number of cells sequenced (3192) seems reasonable, it's not clear how much the conclusions are affected by the depth of sequencing. A more detailed description of the sequencing depth and its impact on gene detection would be valuable.

      We detected a mean of 1088 genes per cell. Based on the 15,319 annotated protein-coding genes in the reference genome, this represents 7.1% of the T. cruzi protein-coding gene complement detected in each cell.

      Across the entire dataset, a total of 14,321 genes were detected in at least one cell, representing 93.5% of all annotated protein-coding genes. This suggests that our experiment captured a broad representation of the parasite's transcriptome.

      This per-cell detection rate is characteristic of droplet-based scRNA-seq and is consistent with other trypanosomatid studies. For example, the T. brucei single-cell atlas (Hutchinson et al., 2021) reported a median detection of 1052 genes per cell. In the case of T. cruzi, the recently published pre-print of the T. cruzi single cell atlas from Laidlaw & García-Sánchez et al. reported a mean between 298 and 928 genes detected per cell (depending on the sample).

      This information is now included in Methods.

      While most of the methods are clear, the way in which the subsampled gene lists were generated could be more thoroughly described, as some details are not clear for the subsampling of single-copy genes.

      The subsampling method was originally described in the Figure 2 legend; to better highlight this approach, we have now moved its description to the Methods section.

      Some of the figures are difficult to interpret. For example, the color scaling in the heatmap of Supplementary Figure 3B is not self-explanatory and it is hard to extract meaningful conclusions from the graph.

      We agree with the reviewer in this assessment. We have now modified the figures to be more self-explanatory and better reflect the conclusions.

      Reviewer #3 (Public review):

      The study aimed to address a fundamental question in T. cruzi and Chagas disease biology - how much variation is there in gene expression between individual parasites? This is particularly important with respect to the surface protein-encoding genes, which are mainly from massive repetitive gene families with 100s to 1000s of variant sequences in the genome. There is very little direct evidence for how the expression of these genes is controlled. The authors conducted a single-cell RNAseq experiment of in vitro cultured parasites with a mixture of amastigotes and trypomastigotes. Most of the analysis focused on the heterogeneity of gene expression patterns amongst trypomastigotes. They show that heterogeneity was very high for all gene classes, but surface-protein encoding genes were the most variable. In the case of the trans-sialidase gene family, many sequence variants were only detected in a small minority of parasites. The biology of the parasite (e.g. extensive post-transcriptional regulation) and potential technical caveats (e.g. high dropout rates across the genome) make it difficult to infer what this might mean for actual protein expression on the parasite surface.

      We thank the reviewer for this important comment, highlighting a central challenge when studying trypanosomatid biology. We acknowledge that in most eukaryotes and particularly in T. cruzi, where there is a predominant role of post-transcriptional regulation, mRNA levels are not always directly correlated with protein abundance, as previously reported by us and others (10.1186/s12864-015-1563-8, 10.1128/msphere.00366-21, 10.1590/S0074-02762011000300002, 10.1042/bse0510031). Nevertheless, steady-state transcript levels obtained by RNA-seq remain informative for assessing differential gene expression, and this approach has been widely used as a proxy for the study of gene expression profiles in T. cruzi (10.7717/peerj.3017, 10.1371/journal.ppat.1005511, 10.1016/j.jbc.2023.104623, 10.3389/fcimb.2023.1138456, 10.1186/s13071-023-05775-4).

      It's also interesting to note that recent proteomic analyses (10.1038/s41467-025-64900-2) have revealed substantial heterogeneity in the expression of surface proteins, including trans-sialidases, supporting the idea that the transcriptional heterogeneity we observe reflects a genuine biological feature that propagates to the protein level.

      We have now added a sentence to the discussion acknowledging this limitation and discussed the results from Cruz-Saavedra, et al. in the revised manuscript.

      (1) Limit of detection and gene dropouts

      An average of ~1100 genes are detected per parasite which indicates a dropout rate of over 90%. It appears that RNA for the "average" single copy 'core' gene is only detected in around 3% of the parasites sampled (Figure 2c: ~100 / 3192). This may be comparable with some other trypanosome scRNAseq studies, but this still seems to be a major caveat to the interpretation that high cell-to-cell variability in gene expression is explained by biological rather than technical factors. The argument would be more convincing if the dropout rates and expression heterogeneity were minimal for well-known highly expressed genes e.g. tubulin, GAPDH, and ribosomal RNAs. Admittedly, in their Final Remarks, the authors are very cautious in their interpretation, but it would be good to see a more thorough discussion of technical factors that might explain the low detection rates and how these could be tested or overcome in future work.

      (2) Heterogeneity across the board

      The authors focus on the relative heterogeneity in RNA abundance for surface proteins from the multicopy gene families vs core genes. While multicopy gene sequences do show more cell-to-cell variability, the differences (Figure 2D) are roughly average Gini values of 0.99 vs 0.97 (single copy) or 0.95 (ribosomal). Other studies that have applied similar approaches in other systems describe Gini values of < 0.2-0.25 for evenly expressed "housekeeping" genes (PMIDs 29428416, 31784565). Values observed here of >0.9 indicate that the distribution for all gene classes is extremely skewed and so the biological relevance of the comparison is uncertain.

      We recognize the limitations imposed by gene dropout in our data, as highlighted by the reviewer. Unfortunately, gene dropout is an inherent limitation of 10x genomics data. Trypanosomatids are not an exception in this regard, and the general metrics of the single-cell RNA-seq data in other reports are equivalent to those obtained in our experiment.

      Despite this important limitation, we believe that our comparative analyses (the contrast between TcS and ribosomal protein expression) provide valuable insights into a biological phenomenon with potential functional relevance for the parasite. Furthermore, we are actively working on generating single-cell RNA-seq data using alternative methodologies that improve gene dropout rates. We anticipate that these future studies will help clarify the extent of the phenomenon described in this work.

      Our results reveal a small subset of TcS genes that are frequently detected across cells, a pattern that is not compatible with random detection unless these genes were highly expressed and preferentially captured by random sampling. However, as shown in Figure 4b, many genes expressed at comparable levels are not detected at high frequencies. In line with this, Figure 4c shows that within individual cells, the detected TcS genes exhibit similar expression levels. Finally, we confirmed that this frequently detected subset shows high read counts at the bulk RNA-seq level (Figure 4 - Figure Supplement 1), consistent with the fact that these TcS are frequent in the population even when they are not specially highly expressed within each cell. Taken together, these findings argue against a purely random sampling of TcS genes and support the interpretation that this pattern reflects an underlying biological feature. We agree that further validation will be required. Accordingly, since the initial submission, we have been careful to frame our conclusions conservatively, explicitly noting that dropout remains a limitation of these data that could influence the observed patterns. In the revised version, we have strengthened this point by including a specific statement in the final remarks. Our interpretation is presented as a working hypothesis that is fully compatible with the observations reported here and may be informative for the field. To better reflect this reasoning, we have revised Figure 4b, expanded the discussion, and explicitly included this limitation in the final remarks of the revised manuscript.

      Nevertheless, this study does provide some tantalising evidence that the expression of surface genes may vary substantially between individual parasites in a single clonal population. The study is also amongst the very first to apply scRNAseq to T. cruzi, so the broader data set will be an important resource for researchers in the field.

      We thank the reviewer for highlighting the relevance of our study and for their positive assessment of the potential significance of these observations. We also agree that the dataset generated here may represent a useful resource for the community.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      (1) In Figures 1c and 1d, it would be useful to include the genes as the plot titles.

      We agree with the reviewer that including gene names in the plot makes the panels more self-explanatory. We have added gene names to the updated version of Figure 1.

      (2) Can you include the read lengths of the sequencing and whether this is sufficient to map accurately to very similar genes of the same multigene family? As stated in the public summary, this would make the data far more convincing as standard 10x chromium cannot distinguish similar gene copies unless a longer read 2 is used. Given that only the 3' end is targeted, is this enough to distinguish the TcS and other mutligene family transcripts?

      We thank the reviewer for raising this important point. We agree that short 3′ biased reads can limit the disambiguation of highly similar multicopy transcripts. This is, in fact, a common challenge when analyzing transcriptomic data from T. cruzi.

      To address this issue, we analyzed the sequence identity of the 3′ ends of TcS transcripts (defined as the 3′UTR plus 20% of the CDS region). As shown in Author response image 1, these regions display a median sequence identity of approximately 25%, indicating that sufficient sequence divergence exists for mapping algorithms to use during read assignment.

      In addition, it is important to note that kallisto, the software used in our analysis, was specifically designed to address multimapping reads through pseudoalignment combined with an expectation-maximization algorithm that probabilistically assigns reads across compatible transcripts.

      To directly assess performance, we simulated reads from the T. cruzi transcriptome used in this study (3′UTRs plus 20% of the CDS regions) and compared two mapping/counting strategies: (a) transcriptome pseudoalignment using kallisto, and (b) genome alignment followed by counting using STAR + featureCounts. The latter approximates the strategy implemented in CellRanger, the standard pipeline for quantifying expression levels from 10X Genomics single cell RNA-seq data. We found that kallisto recovered the simulated “true” counts with substantially higher accuracy than STAR + featureCounts (Pearson correlation: all genes, 0.991 vs 0.595; surface protein genes, 0.9996 vs 0.827; trans-sialidase (TcS) genes, 0.9998 vs 0.773). These results indicate that pseudoalignment is currently the optimal strategy for recovering the relative expression of highly similar gene family members (Author response image 1C).

      The length of the R2 read (91bp) was included in Methods (line 411).

      (3) It is stated that 'single copy' genes also include 'low copy number genes". What does this include exactly? Is it more actuate to say non-surface protein genes?

      The distinction we aim to make is between multigene families and the rest of the genome. Most multigene families encode surface proteins, but not all surface protein genes belong to multigene families. To clarify this point we included a sentence in methods to reflect that when we describe “surface proteins” we are referring to surface proteins coded by multigene families (line 453). In addition, long-read genomic DNA sequencing and assembly have revealed that many genes previously believed to be single-copy are actually duplicated at low copy numbers (doi.org/10.1099/mgen.0.000177). For this reason, we extend the concept of “single-copy” genes to include those that have only a few duplicates.

      (4) It is stated in line 127 that TcS have particular high heterogeneity - it does not look that way by eye compared to the other multigene families. Can statistic be used to prove this, or simply state the decision was made to focus on the TcS?

      As noticed by the reviewer, all multigene families show significantly higher heterogeneity compared to single-copy genes, as stated in the text and shown in figure legends from Figure 2, Supplementary Figure 1 and the new Supplementary Table 2.

      That said, it was not the statistical results that guided our decision to focus on TcS, but rather their well-established biological relevance in T. cruzi. As suggested, we have now emphasized this rationale more clearly in the revised text (lines 160-167).

      Besides, recent work has shown that TcS genes exhibit a bimodal distribution of expression levels using bulk RNA-seq data, in contrast to core genes and other multigene families (doi.org/10.1038/s41467-025-64900-2, doi.org/10.1038/s41564-023-01483-y). This distinct regulatory behavior further justifies our decision to examine TcS separately.

      (5) Expression of different TcS has been investigated between the different life cycle stages for a few individual genes previously (Freitas et al). Can the authors not extend this investigation to all the genes detect by scRNA-seq here to demonstrate those with higher/lower expression in amastigotes vs trypomastigotes building on Figure 2A? Are particular groups linked to either stage?

      We performed this analysis and did not observe any correlation between TcS groups and life cycle stage. In all cases TcS were more frequently detected in trypomastigotes. This difference was statistically significant for all groups except group VII, likely due to the low number of genes analyzed in this group (Author response image 3).

      Author response image 3.

      Per-gene number of expressing cells by TcS group and life-stage. Boxplots show, for each TcS group (I–VIII), the distribution across genes of the number of cells in which the gene is detected. Each point represents a single TcS; Amastigote cells: green points/boxes, Trypomastigote cells: salmon points/boxes. The y-axis is on log10 scale. Asterisks indicate statistically significant differences from the comparison between Amastigote and Trypomastigote within each TcS group, assessed using a paired two-sided Wilcoxon signed-rank test: * p < 0.05, ** p < 0.01, *** p < 0.001.

      (6) What exactly is the Z-score shown in Figure 2B?

      In this analysis num_multigene represents the number of multigene family genes detected in each individual cell. For every cell, we counted how many genes from our predefined multigene family gene list has detectable expression (more than zero UMI counts); in the UMAP plot, this value is reflected by the size of each point. On the other hand, z_multigene captures the relative expression level of multigene family genes within each cell. This metric is calculated by summing the UMI counts of all multigene family genes per cell and then standardizing this value across the dataset using a z-score transformation, such that positive values reflect above-average multigene family expression and negative values reflect below-average levels. In the UMAP plot, this metric determines the color scale of each point. Taking together num_multigene and z_multigene allow us to distinguish cells that express multigene family genes broadly (high gene counts), strongly (high relative expression), both, or neither, and to relate these patterns to identified cell populations.

      We included a short description in legend of the new version of Figure 2 (lines 176-180).

      (7) For the reclustering of trypomastigotes based on TcS genes alone, please show the UMAP and discuss why the resolution giving two clusters is chosen? I assume increasing the resolution does not reveal clusters of cells express one of the 8 groups of TcS for example?

      We appreciate the reviewer’s suggestion. In this analysis, our goal was to test whether the phenotypic heterogeneity previously reported in trypomastigotes could be recapitulated using TcS genes alone, as prior studies described two major transcriptomic phenotypes within this stage.

      Increasing the clustering resolution did not reveal subclusters corresponding to the eight TcS sequence groups. This might reflect the fact that these groups are defined based on sequence similarity rather than on expression patterns, as noted by Freitas et al. (doi:10.1371/journal.pone.0025914).

      (8) In Figure 4B, there may be an upward trend in the level of expression and the number of cells a transcript is detected in? It would be worth showing this is or is not the case with statistics if possible.

      The number of genes detected in a high proportion of cells is low, which limits the statistical power of this analysis. Also, substantial dispersion is observed within the 0-5% interval. Nevertheless, this figure is presented primarily to highlight that a considerable number of highly expressed genes are detected in only a small fraction of cells. If expression level were the main determinant of detection frequency across cells, one would expect very few highly expressed genes to fall within the 0-5% interval. Contrary to this expectation, among the 50 highest expressed TcS genes, 62% are detected in fewer than 5% of cells, and even among the top 10 most highly expressed TcS genes, 40% fall within this lowest detection group. To facilitate this interpretation, we modified the figure (new Figure 4b) to explicitly highlight the top 50 most expressed TcS genes and incorporated this discussion into the main text of the revised manuscript (lines 244-251), making the conclusion clearer to the reader.

      (9) Do the cells group instead by expression of any of the other multigene families not investigated in detail?

      It is possible that additional transcriptional substructure among trypomastigotes is driven by the expression of other multigene families beyond TcS. In this short report (with limited number of figures, words, etc.), we focused specifically on the trans-sialidase family as discussed earlier. A more comprehensive analysis including other large surface gene families (MASPs, mucins, GP63) is planned as part of ongoing work and will be presented in future reports.

      Reviewer #2 (Recommendations for the authors):

      This reviewer suggests the conduction of functional experiments in follow-up studies to establish links between TcS expression profiles and parasite behavior and into potential regulatory mechanisms responsible for the observed TcS heterogeneity, particularly focusing on epigenetic modifications. It would be interesting to correlate the highly expressed TcS members identified here with previously characterized TcS isoforms and provide more description regarding which particular groups and TcS members are driving the findings. It would benefit from further clarification regarding sequencing depth, technical replication merging, subsampling, and specific parameters for alignment methods and more information regarding the specific statistical tests and their applicability to the data.

      This is a promising single-cell study with potentially high significance. The manuscript is well-written, and the analyses are reasonably well-executed. However, the current manuscript is limited by a lack of functional validation and mechanistic insights. The addition of further analyses and experiments, as suggested, will strengthen the conclusions and increase the impact of the work.

      We thank the reviewer for their careful reading of the manuscript. As suggested, we have performed additional validation and clarification of the results, as well as a more explicit discussion of their limitations. In addition, we have included a preliminary analysis exploring potential mechanisms that could be coordinating the observed expression patterns of the TcS family (see below). Even though we consider relevant and interesting to experimentally validate these results, given the inherent difficulties in studying multigene families in T. cruzi, an organism with a very limited set of molecular biology tools (such as RNAi), further experimental validation of these observations is outside of the scope of this short report.

      Regarding the reviewer’s question, we studied if any TcS subgroup could be driving our observations. However, we did not find any correlations indicating that a particular group was associated with any of our findings. We now include TcS group information to Supplementary Table 3.

      Regarding technical details, we now included the total number of mapped reads (line 422) and average number of reads mapped per cell (new paragraph in the Methods section, line 432-436).  

      The technical replicates consist of a single Illumina library that was sequenced in two separate runs. As this approach is expected to be highly reproducible, we merged both runs into a single count table, as stated in line 424. To support this decision, we assessed the concordance between the two sequencing runs and observed an almost perfect correlation between them (Author response image 2).

      The subsampling method was originally described in the Figure 2 legend; to better highlight this approach, we have now moved its description to the Methods section (line 456).

      The specific kallisto parameters used are stated in Methods (line 418-419). We now included that default options were used unless otherwise specified (line 419-420).

      In response to the reviewer’s and editorial team’s request for additional mechanistic insight into the regulatory processes that may be involved in the observed patterns, we have expanded the revised manuscript to discuss how the genomic context of TcS loci could contribute to the observed heterogeneity in TcS expression. As noted in the original version of the manuscript, TcS genes and other surface-protein gene families are largely partitioned into discrete genomic compartments, whose expression has been reported to be regulated by epigenetic control of chromatin-folding domains (doi.org/10.1038/s41564-023-01483-y). However, we previously showed that TcS genes detected in a high proportion of cells are, in most cases, dispersed throughout the genome, arguing against a model in which their preferential expression results from colocalization within a small number of ubiquitously activated chromatin domains. In response to the reviewer’s suggestion, we performed a more detailed analysis of the genomic locations of these TcS genes. We found that many of them are localized within the core compartment (new Figure 5). Because the core compartment is enriched for conserved, housekeeping genes that typically display more constitutive expression (doi.org/10.1038/s41564-023-01483-y), whereas the disruptive compartment is enriched for lineage-specific multigene families associated with variable, stage-specific, and recently reported stochastic expression (doi.org/10.1038/s41467-025-64900-2), our results are consistent with a model in which compartment-specific regulatory mechanisms (in addition to post-transcriptional regulation) influence the differential cellular expression of core- versus disruptive-located TcS genes. We have incorporated these results and discussion in line 301-313 of the revised manuscript.

      Reviewer #3 (Recommendations for the authors):

      (1) The authors consistently refer to gene "expression" but somewhere they should acknowledge that in trypanosomes RNA abundance is less predictive of protein than in most other organisms.

      We thank the reviewer for this important comment, highlighting a central challenge when studying trypanosomatid biology. We acknowledge that in most eukaryotes and particularly in T. cruzi, where there is a predominant role of post-transcriptional regulation, mRNA levels are not always directly correlated with protein abundance, as previously reported by us and others (10.1186/s12864-015-1563-8, 10.1128/msphere.00366-21, 10.1590/S0074-02762011000300002, 10.1042/bse0510031). Nevertheless, steady-state transcript levels obtained by RNA-seq remain informative for assessing differential gene expression, and this approach has been widely used as a proxy for the study of gene expression profiles in T. cruzi (10.7717/peerj.3017, 10.1371/journal.ppat.1005511, 10.1016/j.jbc.2023.104623, 10.3389/fcimb.2023.1138456, 10.1186/s13071-023-05775-4).

      It's also interesting to note that recent proteomic analyses (10.1038/s41467-025-64900-2) have revealed substantial heterogeneity in the expression of surface proteins, including trans-sialidases, supporting the idea that the transcriptional heterogeneity we observe reflects a genuine biological feature that propagates to the protein level.

      We have now added a sentence to the discussion acknowledging this limitation and discussed the results from Cruz-Saavedra, et al. in linea 266-271 of the revised manuscript.

      (2) Line 29, in the abstract there is a strong statement that T. cruzi "does not employ antigenic variation". I don't think there is much evidence either way if we are thinking about antigenic variation in the broad sense rather than the extreme model of T. brucei VSG switching. Later in the abstract they state that "no recurrent combinations of TcS genes were observed between individual cells in the population", which sounds very much like a form of antigenic variation.

      We agree with the reviewer. Indeed, we meant to state that T. cruzi does not employ an antigenic variation mechanism such as the one from T. brucei. We change this statement as suggested in lines 28 - 32.

      (3) Line 29, "relies on a diverse array of cell-surface-associated proteins encoded by large multi-copy gene families (multigene families) essential for infectivity and immune evasion" and lines 55-58 "T. cruzi infection relies on a heterogeneous set of membrane proteins, encoded mainly by large multigene families ... most of which are involved in infection, tropism, and immune evasion". It would be worth adding a bit more detail on the nature and strength of the evidence that Tc "relies on" these various genes or that they are "essential" for infectivity, tropism, and immune evasion.

      Because the journal’s short format imposes word limits, we strengthened the original statement by adding specific references that document genomic, transcriptomic and functional evidence linking the major multigene families to infectivity, tropism and immune evasion (doi.org/10.1371/journal.pone.0025914; doi.org/10.1038/nrmicro1351; doi.org/10.1128/iai.05329-11; doi.org/10.1093/nar/gkp172, doi.org/10.1371/journal.ppat.1006767), in line 77.

      (4) Line 89, 1088 genes detected per cell - what is this as a % of genes in the genome?

      We detected a mean of 1088 genes per cell. Based on the 15,319 annotated protein-coding genes in the reference genome, this represents 7.1% of the T. cruzi protein-coding gene complement detected in each cell.

      Across the entire dataset, a total of 14,321 genes were detected in at least one cell, representing 93.5% of all annotated protein-coding genes. This suggests that our experiment captured a broad representation of the parasite's transcriptome.

      This per-cell detection rate is characteristic of droplet-based scRNA-seq and is consistent with other trypanosomatid studies. For example, the T. brucei single-cell atlas (Hutchinson et al., 2021) reported a median detection of 1052 genes per cell. In the case of T. cruzi, the recently published pre-print of the T. cruzi single cell atlas from Laidlaw & García-Sánchez et al. reported a mean between 298 and 928 genes detected per cell (depending on the sample).

      This information is now included in Methods (line 435).

      (5) Line 93-94, how many cells were assigned to clusters 0 and 1?

      Cluster 0 had 2201 cells and cluster 1 had 824 cells assigned.  We have now included these specific numbers in new version of the manuscript (line 114).

      (6) Line 96, cluster 2 ama-trypo transitioning parasites - were these observable by microscopy?

      We did not perform microscopy specifically to observe or quantify the putative ama/trypo transitioning subpopulation: microscopy was only used as a pre-experiment quality check to verify cell morphology and viability. The inference that cluster 2 reflects ama/trypo transitioning parasites is drawn from the transcriptomic profile (particularly from the pattern of stage-associated marker expression observed in that cluster) and should be considered a hypothesis generated by the data, that merits further analysis, as stated in the manuscript.

      (7) Line 106-107, "As expected, single-copy gene expression is high in both amastigotes and trypomastigotes and similar on average between both cell types".

      (8) Why as expected? For a broad journal it would be useful to explain this. Amastigotes are replicative and trypomastigotes are not, so would we not expect to see some differences that reflect this?

      (9) What do you mean by the expression being "high"? High compared to what?

      (10) "Similar on average between both cell types". This does not seem concordant with Figure 1a showing a highly significant difference between ama and trypo.

      We thank the reviewer for this helpful request for clarification for broader readers and the observations regarding global expression of single copy and multigene family genes.

      Figure 2a is intended as an experimental control where we show that our 10X Genomics data shows the previously reported upregulation of surface protein genes in trypomastigotes. We have now modified the text in order to highlight this (line 129). In turn, Supplementary Figure 1a is shown as a control that this upregulation is not a general feature of trypomastigote cells.

      Regarding comment 9, what we meant is that single-copy genes display relatively high expression in both amastigotes and trypomastigotes compared with surface protein-coding genes (see expression values in Figures 2a and Supplementary Figure 1a).

      Finally, differential expression between amastigotes and trypomastigotes at the transcriptomic level has been previously studied and has shown that most single copy genes do not show variation, explaining the overall pattern of Supplementary Figure 1a where average expression is similar between stages (mean fold change = 1.1). This is likely due to the fact that these genes are related to basic cellular functions. Genes related to stage specific functions such as replication in amastigotes or normalization effects may be causing the slight, but statistically significant increase observed in overall expression in amastigotes. This contrasts with the pattern observed for multigene families where there is a clear overexpression in trypomastigotes (mean fold change = 1.5).

      As observations commented on questions 9 and 10 have been described in previous studies and are not novel nor key points in our results, we decided not to focus on them and modified the text accordingly in lines 129-135.

      (11) Line 110, "with high variation". What does "high variation" mean here? Compared to what? For the two metrics (n cells +ve for each gene and total expression level) can they give an average and the SD? It would be useful to know how many parasites the "average" surface (and core) gene is expressed in, or more precisely for which the RNA is above the limit of detection.

      We refer to the comparison with the expression profile observed for single-copy genes. This point has now been clarified in the text, and we have included the mean and standard deviation for both TcS multigene family genes and single-copy genes in trypomastigotes for both metrics in the Figure 2 legend. The average and distribution of the number of cells in which each gene is detected are shown in Figure 2c and Supplementary Figure 1a. We also added a reference to this panel at the point in the text where the phenomenon is first described.

      (12) Line 134, Figure 2b legend needs more detail - what are num_multigene and z_multigene?

      Please see our response to Reviewer 1, Question 6. We have now added a clarification to the legends of Figure 1 and Supplementary Figure 1.

      (13) Figure 2c, correct the y-axis legend because it implies your values are log10 transformed. Also, it would be useful to have more markers on the y axis so the reader can better estimate the data ranges.

      We thank the reviewer for this observation. We have now corrected the y-axis label and markers.

      (14) If the y-axis of Figure 2D started at 0 instead of 0.8 and if Lorenz curves were provided then the reader would probably get a fuller sense of the expression heterogeneity in the dataset. The legend states the differences are statistically significant but the actual p-values are not shown.

      (15) Line 142-3, more precision is needed on the p-values.

      We thank the reviewer for this helpful suggestion. We agree that Lorenz curves provide a clearer representation of expression heterogeneity than the previous plot. Accordingly, we have replaced the original panel (Figure 2d) with Lorenz curves for the groups under comparison, and have made the same change in Supplementary Figure 1d. In addition, we have included gini index values and p-values for all comparisons in Supplementary Table 2.

      (16) Figure 3, as in Figure 1a it would be useful to add another UMAP plot to show the two trypo subpopulations.

      We thank the reviewer for this suggestion. We have now updated Figure 3 to include a UMAP plot showing the two trypomastigote subpopulations.

      (17) What is the observed proportion of broad vs slender trypomastigote morphologies for Dm28c? To be consistent with the speculation at line 162 then wouldn't it need to be approximately 50-50?

      The proportions of each trypomastigote subpopulation in the DM28c strain are currently unknown. The only available relevant data come from Brener, 1965 (doi.org/10.1080/00034983.1965.11686277), in which this strain was not included. In the strains analyzed in that study, the relative proportions of broad and slender trypomastigote morphologies were highly variable: across seven strains, broad forms ranged from 18.0% to 77.3%, while slender forms ranged from 2.3% to 71.6%. Given this wide variability and the lack of DM28c-specific data, we cannot assume any expected proportion for this strain.

      (18) Line 170, please state how many genes are in the TcS subgroup mentioned here. This is an interesting finding - does this include mostly catalytically active trans-sialidase genes or is it a mixture from across all the subfamilies?

      The TcS subgroup with a high frequency of detection comprises 31 genes, none of which belong to the catalytically active Group I trans-sialidases. Instead, this subgroup includes members of Groups II, III, IV, V, VI, and VIII. This information has been added to Supplementary Table 3 and is now stated in the revised manuscript (lines 227 - 228).

      (19) Line 175-176, "Gene dropouts might favor random patterns of gene family's detection in scRNA-seq experiments, particularly affecting genes with low expression" - I'm not sure if the authors mean the detection of a gene (or not) in an individual parasite is truly random (pure luck) or whether the term stochastic would be more appropriate because they seem to be referring to randomness around a certain threshold of RNA abundance/stability? They go on to rule this out, at least for TcS genes, essentially arguing that they have something resembling an ON or OFF pattern rather than a spectrum of expression levels. This is potentially very important and could advance the field in a major way, but the fact that so many core and ribosomal genes, which 'should' be always ON, cannot be detected in most cells is a concern. A version of Figure 4B for core and ribosomal genes could be informative - do they show a different pattern to TcS?

      Our results reveal a small subset of TcS genes that are frequently detected across cells, a pattern that is not compatible with random detection unless these genes were highly expressed and preferentially captured by random sampling. However, as shown in Figure 4b, many genes expressed at comparable levels are not detected at high frequencies. In line with this, Figure 4c shows that within individual cells, the detected TcS genes exhibit similar expression levels. Finally, we confirmed that this frequently detected subset shows high read counts at the bulk RNA-seq level (Supplementary Figure 2), consistent with the fact that these TcS are frequent in the population even when they are not specially highly expressed within each cell. Taken together, these findings argue against a purely random sampling of TcS genes and support the interpretation that this pattern reflects an underlying biological feature. We agree that further validation will be required. Accordingly, since the initial submission, we have been careful to frame our conclusions conservatively, explicitly noting that dropout remains a limitation of these data that could influence the observed patterns. In the revised version, we have strengthened this point by including a specific statement in the final remarks. Our interpretation is presented as a working hypothesis that is fully compatible with the observations reported here and may be informative for the field. To better reflect this reasoning, we have revised Figure 4b, expanded the discussion, and explicitly included this limitation in the final remarks of the revised manuscript.

      (20) Line 238-9, Add details of removing extracellular epimastigotes after cell infections.

      Only cellular trypomastigotes collected from the supernatant on day 6 were used for the secondary infection, at a 10:1 parasite-to-cell ratio. After 24 hours, the cultures were washed twice with PBS to remove any remaining extracellular parasites. Under these conditions, i.e. using exclusively trypomastigotes, at this infection ratio, and maintaining the cultures in mammalian medium, we do not expect the presence or survival of extracellular epimastigotes. We have included a sentence in the Methods section clarifying this information in the revised version of the manuscript, line 382.

      (21) Line 260, was methanol used to directly resuspend the parasite pellet, or was it resuspended first e.g. in a small volume of PBS?

      As described in lines 250-257 of the original manuscript, parasites were washed and resuspended in DPBS before methanol fixation. Methanol fixation was then carried out according to the 10X Genomics Methanol Fixation Protocol. We have now emphasized this more clearly in the revised text in line 400.

      (22) What was the doublet rate?

      We identified and removed 41 doublets, all belonging to cluster 2, and retained 3,151 singlets for downstream analysis (total cells before removal = 3,192). The resulting doublet rate was 1.28%. We have included a sentence in the Methods section clarifying this information in the revised version of the manuscript, line 439 -440.

      (23) What was the frequency of rRNA and kDNA-derived reads?

      Approximately 4.02% of the reads were derived from kDNA sequences, while 1.10% corresponded to rRNA-derived reads (Author response image 4).

      Author response image 4.

      Percentage of mitochondrial and ribosomal rRNA derived reads.

    1. eLife Assessment

      This work of fundamental significance introduces a novel statistical model of spiking activity that incorporates continuous-time gain modulation. The authors provide exceptional evidence that the model outperforms earlier approaches and alternative candidates in capturing spiking responses across multiple visual areas in the macaque. Beyond its methodological contribution, the study offers new insights into how stimulus-driven variability and internally generated gain fluctuations evolve over time and between brain areas. The framework is likely to find broad application beyond the datasets examined here.

    2. Reviewer #1 (Public review):

      Summary:

      In this manuscript, Rupasinghe and co-authors introduce a new statistical model for spiking neurons. Building on earlier work, they propose to model spikes as arising from a Poisson process whereby the firing rate is the product of stimulus drive and a stimulus-independent gain signal. The critical innovation of this work is that the gain signal is modeled in continuous time. Earlier explorations of this statistical construction treated the gain-signal as constant within a trial. This innovation is elegant and important. It makes the model richer, more plausible, and more broadly applicable. The authors show that the model parameters are recoverable from realistic amounts of data and then apply the framework to previously studied datasets. They show that the new model outperforms earlier models and alternative candidates in capturing spiking data across four visual areas of the macaque monkey. Analysis of the model parameters replicates some earlier findings and uncovers several new insights. The model and fitting methods can be broadly applied to partition different types of signals and noise from spiking data and are likely to be widely adopted in the systems neuroscience community.

      Strengths:

      (1) Through clever use of advanced statistical techniques, the authors manage to infer critical information from single-trial single-cell data.

      (2) The question of which aspect of a spike train is signal and which is noise is omnipresent in neuroscience. By improving our ability to characterize the distinct factors that shape spiking activity, this work makes a fundamental contribution to the literature.

      Weaknesses:

      Overall, I find the work impressive and important. I have a couple of questions and suggestions.

      (1) The work is entirely focused on single-cell data. While this is a great starting point, expanding the approach to spiking activity in neural populations is an important future goal.

      (2) Line 49-53: These statements seem incorrect to me. The modulated Poisson model, as introduced in Goris et al (2014), is a process model that can perfectly be used to generate spike trains (within a trial, spiking emerges from a Poisson process, which can be homogeneous or inhomogeneous). Moreover, the model contains a parameter that represents the duration of the counting window (delta t). The dependency of over-dispersion on the size of the time bins for real neurons is shown in Figure 1b (inset plot) of that paper (and shown to resemble the model prediction). This time-dependency was further explored by the same authors in Goris et al (2018 - Journal of Vision) and also in Hénaff et al (2020 - Nature Communications ). I suggest that the authors rephrase this argument (here and at some later points in the paper). They could just say that the Goris model makes the simplistic and implausible assumption that, within a given trial, gain does not fluctuate. This is clearly an important limitation and the key difference with the continuous model introduced here.

      (3) Line 54-55: I think the first part of the claim is a bit misleading. There is nothing in the Goris model that would inherently limit it to homogeneous Poisson processes, as seems to be implied by this description. The model is built on the assumption that spike generation within a trial arises from a Poisson process. This may very well be an inhomogeneous Poisson process (i.e., a stimulus-dependent time-varying firing rate). Homogeneous and inhomogeneous Poisson processes both give rise to Poisson distributed spike counts (and thus a mixture of Poisson distributions across trials in the Goris model). I suggest the authors clarify this description a bit. Note that the two model variants illustrated in Figure 1b and c were also explored in Hénaff et al (2020 - Nature Communications).

      (4) The extension to the continuous case is very elegant!

      (5) I find the result shown in Appendix 3 critically important. The recoverability of the model for realistic amounts of data is foundational for the rest of the paper. I would consider including this analysis in the main results section. Not all readers may check Appendix 3, but they should know about this result.

      (6) Figure 3: I am wondering whether the inferred gain is capturing some response fluctuations that originate from the cell's phase-selectivity. Could the authors compute the trial-averaged inferred gain (ideally, aligned to stimulus-phase at the start of the trial if this experimental parameter varied across repeats)? If they have successfully partitioned the response variance, the trial-averaged gain should have no systematic temporal structure. If it has a sinusoidal modulation, it may partially capture stimulus-drive. This could be an interesting test to run on all model fits to further validate that the partitioning into a signal and noise component succeeded as intended.

      (7) One common observation that is currently not explored is the quenching of neuronal response variability following stimulus onset (Churchland et al 2010 - Nature Neuroscience), which was suggested to reflect a quenching of gain variability in Goris et al (2024 - Nature Reviews Neuroscience). Building on the previous suggestion, the authors could compute the temporal evolution of cross-trial gain variability from the inferred gain traces. Do they recognize a reduction in gain variability following stimulus onset? If so, it would be worthwhile to show this.

      (8) Line 543-565: I want to make sure I understand the Baseline Poisson model and Poisson-GP correctly. For the baseline model, I had imagined that the authors would simply use the stimulus-conditioned PSTH as an estimate of the time-dependent firing rate, coupled with an inhomogeneous Poisson process assumption. But they additionally assume a Gamma prior on the firing rate to compensate for the sparseness of the data (sometimes only 5 repeats per condition). The Poisson-GP includes exactly the same model components, but now the time-dependent firing rate is modeled by a Gaussian process. Doing this massively improves the goodness-of-fit (Fig 4A). Do I understand this correctly?

    3. Reviewer #2 (Public review):

      Summary:

      Neurons have varied responses to external stimuli that cannot be explained by naive Poisson models. Previous work has quantified and partitioned higher-than-Poisson variability in the brain into different components. The authors improve on these methods to infer how both the stimulus drive and internal gain dynamics impact neuronal variability continuously in time. The clean and well-reasoned model is rigorously developed and then applied to neural data across the visual hierarchy. This lends new insights into how variability is partitioned, agreeing with and extending previous work on how that variability changes from early visual areas (LGN, V1) through to higher, motion-sensitive areas (area MT). Another key contribution is that this partitioning can be fully addressed as a continuous-time process, which allows for the dissection of how the timescale of fluctuations in these two components changes across the brain's processing arc.

      Strengths:

      (1) The model is cleanly derived and thoroughly documented, including usable code shared in a GitHub repo. This makes the method immediately portable to other neural systems.

      (2) This is a clear and well-presented piece of work. The figures and writing are clear and understandable, and all pieces of the derivations are included in the main text and supplementary information.

      (3) Comparisons to other models, particularly the one from Goris et al., 2014 shows how this Continuous Modulated Poisson (CMP) model outperforms previous work.

      (4) New insights about how variability partitioning changes across the visual stream from LGN to MT are revealed, including how the gain fluctuates on longer timescales in higher visual areas. Another key result about the anticorrelation between the variance in stimulus drive and gain fluctuations comports with theories about how neurons maintain efficient, reliable encoding.

      (5) In addition to the results reported here, this work will serve as an excellent tutorial for students and postdocs first delving into the sources of variability in the brain.

      Weaknesses:

      The work is somewhat incremental, building on previous studies of the partitioning of variability in the brain, but it provides important new extensions, as noted above.

      The only major gap I would suggest addressing in the Discussion is the observation of sub-Poisson variability in the brain. It seems clear that this model can extend to sub-Poisson variability and its partitioning and perhaps even show how that varies in real time, with an animal's attentional state. That is, of course, beyond the scope of the current work, but could be mentioned in the Discussion.

    4. Author response:

      Reviewer #1 (Public review):

      We thank the reviewer for the thoughtful and detailed evaluation of our manuscript. We are pleased that the continuous-time formulation and its methodological contributions were viewed as elegant and broadly applicable, and that the empirical analyses provide meaningful new insights into neural variability across the visual hierarchy. We appreciate the reviewer’s constructive suggestions and clarifications, which will help us improve the precision, clarity, and scope of the manuscript. Below we respond to each point in turn and outline the revisions we will make.

      (1) Extension to neural populations: We thank the reviewer for this important suggestion. We agree that extending the framework to population recordings is a natural next step. In this work, we focus on single-cell data to establish the model and validate inference. In the revised manuscript, we will expand the Discussion to outline how the framework could be generalized to population activity, for example by incorporating shared latent-variable structure.

      (2) Clarification regarding the Modulated Poisson model: We thank the reviewer for pointing this out. We agree that our description was not sufficiently precise and may have been unclear. The modulated Poisson model introduced in Goris et al. (2014) is indeed a generative process model that can be used to generate spike trains, and we apologize for the inaccurate characterization of this framework. Our intended point was that the original formulation assumes gain is constant within a trial (or counting window) and does not provide a principled mechanism for modeling continuously time-varying gain fluctuations within trials. In the revised manuscript, we will clarify this distinction and revise the relevant passages accordingly. We will also cite and discuss related extensions and analyses in Goris et al. (2018) and Hénaff et al. (2020) to provide a more accurate and complete characterization of prior work.

      (3) Continuous extensions of the Goris model: We thank the reviewer for this helpful clarification. We agree that the Goris model is not limited to homogeneous Poisson spiking and can incorporate a stimulus-dependent, time-varying firing rate within trials. We did not intend to imply otherwise, and we will revise the relevant text to avoid this misunderstanding. Our intended point was that, in formulating continuous-time extensions, we explicitly model the time-varying stimulus drive using a GP prior, as in the CMP framework, and then consider different assumptions about the temporal structure of the gain process, including constant and finely sampled gain. This highlights the distinction between piecewise-constant gain assumptions and the fully continuous gain process introduced in our model. We will clarify this distinction in the revised manuscript. We will also acknowledge related variants explored in Hénaff et al. (2020) and more clearly describe how our formulation differs, including the role of smoothness priors on the stimulus drive and gain processes.

      (4) Continuous-time extension: We thank the reviewer for the positive comment and are pleased that the continuous-time formulation was viewed as elegant.

      (5) Parameter recovery analysis: We thank the reviewer for emphasizing the importance of this result. We agree that demonstrating parameter recoverability is foundational to the paper. In the revised manuscript, we will move the Appendix 3 analysis into the main Results section and clearly illustrate how our inference procedure faithfully recovers the generative parameters in simulation studies.

      (6) Validation of gain–stimulus separation: We thank the reviewer for this insightful suggestion. We agree that verifying that the inferred gain does not capture stimulus-driven structure is an important validation of the model. In the revised manuscript, we will compute the trial-averaged inferred gain, to assess whether it exhibits systematic temporal structure. This analysis will provide an additional check that the partitioning between stimulus drive and gain fluctuations operates as intended.

      (7) Temporal evolution of gain variability: We thank the reviewer for this valuable suggestion. We agree that examining whether gain variability decreases following stimulus onset is an important and relevant analysis. In the revised manuscript, we will compute the temporal evolution of cross-trial gain variability from the inferred gain traces and assess whether a quenching effect is observed after stimulus onset. If present, we will report and illustrate this result.

      (8) Clarification of Baseline Poisson and Poisson-GP models: We thank the reviewer for this careful reading. Yes, this understanding is correct. The Baseline Poisson model uses a stimulus-conditioned PSTH as an estimate of the time-dependent firing rate and includes a Gamma prior to regularize rate estimates in conditions with sparse repeats. The Poisson-GP model retains the same structure but models the time-dependent firing rate using a stimulus-specific Gaussian process prior, which substantially improves goodness-of-fit. In the revised manuscript, we will clarify this description. We will also highlight that Figure 4 – figure supplement 2 illustrates how introducing a GP smoothness prior on the stimulus drive markedly improves model fit, even within the Goris-style model.

      Reviewer 2 (Public review):

      We thank the reviewer for the thoughtful and positive assessment of our work. We are pleased that the model development, empirical analyses, and presentation were found to be clear and rigorous. We appreciate the recognition that the continuous-time formulation meaningfully extends prior variability-partitioning approaches and enables a more precise characterization of how stimulus drive and internal gain dynamics evolve across temporal scales. We are also encouraged that the cross-area analyses and model comparisons were viewed as providing new insights and clear empirical improvements. Below, we address the specific suggestions raised by the reviewer.

      Positioning relative to prior work: Regarding the comment on incremental contribution, we agree that our framework builds directly on earlier variability-partitioning approaches. Our goal was to extend these models to continuous time and to develop a principled inference framework capable of characterizing how gain dynamics evolve across temporal scales. We will further clarify this positioning in the revised manuscript.

      Extension to sub-Poisson variability: We thank the reviewer for this suggestion. We agree that sub-Poisson variability is an important phenomenon observed in neural data. Because the CMP model builds on a Poisson observation model with stochastic gain modulation, it naturally captures Poisson and super-Poisson variability but cannot generate sub-Poisson spike count statistics in its existing form. We will clarify this limitation in the revised manuscript and expand the Discussion to outline potential extensions that could address sub-Poisson variability, such as incorporating spike-history effects, renewal-process models, or alternative count distributions.

    1. eLife Assessment

      This valuable study demonstrates how individual taste preferences shift over time, how these changes relate to cortical activity, and how experience reshapes both. The evidence is largely solid, although additional analyses are needed to strengthen some of the conclusions. The results should be of interest to neuroscientists studying sensory physiology.

    2. Reviewer #1 (Public review):

      Summary:

      Maigler et al. set out to test the hypothesis that individual differences in taste preferences are (in part) due to individual differences in central taste processing. The first tested rats' preferences for a variety of taste stimuli on multiple days. They then recorded responses of neurons in the taste cortex to the same tastes on two consecutive days.

      Strengths:

      The authors collected high-resolution behavioral data from the same animals across multiple days, allowing for a detailed characterization of individual variation in taste preferences. They then performed recordings from the same set of animals in response to the same stimuli, allowing them to draw parallels between behavioral and neural responses. They convincingly show that preference ranks for a variety of basic tastes change over time and that the correlation between neural responses and preferences is not stable, correlating more strongly with more recent measures of preference.

      Weaknesses:

      Behavioral analysis: Data presentation does not show how preferences change over the course of testing. In particular, it is unclear whether there are any systematic changes in preferences over the course of testing that could explain the observed changes in correlation with neural responses, such as changes due to learning (e.g., flavor nutrient conditioning, relief of neophobia), changes in deprivation state, or habituation to/proficiency with the BAT setup. A secondary point is whether any changes in preference are attributed to internal individual versus external contextual factors. Both types of variation (i.e., across individuals and across time within an individual) are mentioned in the introduction, but it is not clear what the authors believe about the nature or neural representation of these sources of variation.

      With respect to neural data analysis, no individual animal/day data are shown, making it difficult to assess the extent to which differences in correlation match individual differences in preferences and/or changes in preference with time within individuals. The correlation analysis is also lacking control for the fact that there is a certain degree of "chance" associated with behavioral and neural measures having matching ranks.

      Finally, the conclusion that correlations between final day preferences and neural responses obtained from the second recording session are the result of experience needs more justification; it is unclear to what extent changes in correlation may be attributed to overall changes in responsiveness of the neural population.

    3. Reviewer #2 (Public review):

      Summary:

      The study from Maigler et al investigates how between- and within-animal differences in taste preference relate to differences in neural responsiveness. The experiments rely on an elegant combination of behavioral assays to measure preference (e.g., repeated brief access testing, BAT) and electrophysiological recordings to monitor the activity of ensembles of neurons in the gustatory cortex (GC) of rats.

      BAT with distinct batteries of tastants revealed pronounced variability in preference (measured as licking bout size) across individuals. This variability across individuals persisted after repeated testing. Repeated BAT also revealed that each rat's preference for different tastants changed across time.

      Electrophysiological responses of GC neurons to batteries of tastants showed that firing in the "late epoch" of taste processing (i.e., 500ms post taste delivery) correlated more strongly with the individualized rat's BAT preference rather than with a canonical preference ranking. Importantly, this correlation was stronger for the last BAT session compared to the first. Finally, the author shows that the correlation disappeared in a second, consecutive recording session, indicating that exposure to tastants reconfigures preferences.

      Strengths:

      (1) The experimental design allows for an unprecedented look at the relationship between individual variability in taste preferences and neural processing.

      (2) The study demonstrates that taste preference variability is not mere experimental noise but reflects the dynamic nature of taste. A key strength is the clear evidence that behavioral variability is reflected in neural activity patterns, establishing a strong correlation between brain and behavior.

      (3) The evidence that simple exposure to familiar tastes can reconfigure preferences and taste representations is interesting.

      Weaknesses:

      (1) The manuscript could use additional corollary analyses to provide a more complete picture of the phenomenon. For instance, how many neurons (per animal and in total) have significant correlations with the final BAT patterns? And with the first BAT? Can a time course of such counts be provided? Can some decoding analyses be performed at a single session level to reconstruct a rat's behavioral preference pattern from its neural activity?

      (2) The manuscript could benefit from additional polishing, both in the text as well as in the figures.

    4. Reviewer #3 (Public review):

      Summary:

      Maigler & Lin et al present a compelling set of behavioral and electrophysiological experiments exploring how individual differences in taste preference map onto neural responses in the gustatory cortex (GC). They go on to examine how both preferences and neural responses shift following intervening taste experience. Their experiments are strengthened by examining tastes of distinct identities and palatability (sweet, sour, salty, bitter) and corresponding each animal's individual preference to the palatability-related late phase of the neural response.

      Strengths:

      (1) They demonstrate a relationship between the behavioral expression of taste preference and palatability-related GC neural responses. The direct correlation of expression of taste preference with GC neural responses indicates that taste preference behavior may be less noisy than previously thought, reflecting actual neural activity.

      (2) They address the stability of individual taste preference by comparing within and between session expression. This finding indicates that individual preference on any given test session can differ from canonical palatability.

      (3) They provide evidence that representational drift in palatability coding may arise from sensory experience rather than from the passive passage of time. The findings are novel and potentially impactful. The results are relatively complete.

      Weaknesses:

      Experiments require further clarification. The interpretations would be strengthened by reorganizing the experimental design.

      (1) Figures 5-6 show shifts in palatability-related GC responses from recording day 1 to recording day 2. The authors propose that this drift is due to the taste experience during recording day 1, but the study, as designed, does not directly test this idea. Without a behavioral measure collected after recording day 1 intraoral exposure, it is not possible to determine whether taste preference was altered by that experience, nor whether the neural responses collected on recording day 2 represent current or most recent palatability expression vs something else. The authors' conclusion would be strengthened by adding an intervening brief access test between recording days 1 and 2. The authors could then determine whether the behavioral preferences changed after intraoral taste exposure on recording day 1, as well as whether the new set of taste-related palatability responses correlates with the updated taste preferences.

      (2) The current experimental design exposes animals to 3 distinct sets of substances. These substances differ in identity (some rats never experienced sweet, while others did not experience bitter during the recording sessions) and concentration (ranging from very aversive to slightly aversive or possibly even neutral). Because palatability is known to be comparative depending on the other substances available and concentration-dependent, this introduces challenges to interpretation.

      The authors state that "no differences in effects were observed between taste batteries" (Methods), but it is not clear which analyses were performed to determine the lack of difference, especially considering that many of the analyses are within-animal. Without more clarity, it is difficult to evaluate whether the interaction of different tastes within the sets of stimuli biases the main conclusions.

      (3) Responses to sweet tastes are not reported in the electrophysiology data. This is seemingly the case because rats given set 1 received no sweet stimulus while rats given set 2 received to 2 distinct sweet tastes. Finally, rats given set 3 did not receive quinine, yet quinine is reported in electrophysiology data.

      (4) The choice of reporting average lick cluster size is problematic because the authors use thirsty rats with 10-second-long trials. Thirsty rats are likely to lick in relatively long clusters, especially for neutral and palatable tastes. If the rat is mid-cluster when the trial ends, the final cluster would be cut off prematurely, resulting in shorter overall average lick cluster size, disproportionately affecting neutral and palatable tastes over aversive tastes.

      (5) Canonical palatability rankings may not apply to the concentrations selected in every stimulus set. This is particularly true for set 1, which included two concentrations of citric acid and quinine for the behavior. It is also not clear which concentrations are reported in Figures 3A2 and 3B2. Meanwhile, the concentrations of quinine and citric acid used for electrophysiology are quite low.

    5. Author response:

      Public Reviews:

      Reviewer #1 (Public review):

      …It is unclear whether there are any systematic changes in preferences over the course of testing that could explain the observed changes in correlation with neural responses, such as changes due to learning (e.g., flavor nutrient conditioning, relief of neophobia), changes in deprivation state, or habituation to/proficiency with the BAT setup.

      For the revision, we will add analysis (including either additional panels for Figure 3 or as a new Figure between what are now Figures 3 & 4) testing the hypothesis that preference changes across testing days are non-random. Concretely, we will test: 1) whether the preference for palatable tastes increase with experience (a result that would make sense given research on neophobia; 2) whether the preference for aversive tastes decrease with experience; and 3) whether absolute consumption of any particular taste changes in a reliable direction from session to session.

      A secondary point is whether any changes in preference are attributed to internal individual versus external contextual factors. Both types of variation (i.e., across individuals and across time within an individual) are mentioned in the introduction, but it is not clear what the authors believe about the nature or neural representation of these sources of variation.

      While we assume that differences between rats are due to internal factors (given the controlled home-cage environment), we can’t be sure that some subtle, subthreshold (for us as observers) factor impacts taste preferences. Similarly, while changes across time within an individual is categorically within the individual, we cannot be sure whether some subtle facet of their experiences determines how preferences change (as opposed to it being purely internal). We will add prose to the Discussion session on this topic—including citation of Hilary Schiff’s recent work showing nurture-related preference changes as part of this new prose.

      With respect to neural data analysis, no individual animal/day data are shown, making it difficult to assess the extent to which differences in correlation match individual differences in preferences and/or changes in preference with time within individuals.

      The revision will include Figure panels (with analysis) showing the relationships between individual neural responses and consumption in the first and last BAT tests for 1-2 representative rats.

      The correlation analysis is also lacking control for the fact that there is a certain degree of "chance" associated with behavioral and neural measures having matching ranks.

      Certainly chance cannot explain our results, which consist mainly of within-rat differences in match (i.e., specific enhancement of that match for the most recent behavioral assessment)—a finding that is all the more surprising given that: 1) 2 weeks separate that behavior test and the electrophysiology session; and that 2) that 2-week gap is only 1-3 days less than the gap using the first behavioral test (that reliably correlates less well with the neural data). Nonetheless, we will add an independent, convergent analysis to the revision, testing whether the observed pattern vanishes when we shuffle the preference ranks in the behavioral data—if the result is based on chance, this shuffling should have no impact on the neural-behavioral match.

      Finally, …it is unclear to what extent changes in correlation may be attributed to overall changes in responsiveness of the neural population.

      We will include a new analysis in the revision testing the hypothesis that the reduction in match between the neural and behavioral rankings reflects changes in neural excitability—spontaneous and taste-driven—between the first and second electrophysiology sessions.

      Reviewer #2 (Public review):

      The manuscript could use additional corollary analyses to provide a more complete picture of the phenomenon. For instance, how many neurons (per animal and in total) have significant correlations with the final BAT patterns? And with the first BAT? Can a time course of such counts be provided? Can some decoding analyses be performed at a single session level to reconstruct a rat's behavioral preference pattern from its neural activity?

      These are all really good ideas. We are in the process of implementing all but the last; we will attempt the last as well, but can’t promise that we have large enough ensembles to provide stable results of such a subtle decoding task (reflecting the last BAT session’s preference pattern significantly better than the first session’s pattern).

      The manuscript could benefit from additional polishing, both in the text as well as in the figures.

      It is being done, on the basis of suggestions made by R2 in the non-public comments.

      Reviewer #3 (Public review):

      Without a behavioral measure collected after recording day 1 intraoral exposure, it is not possible to determine whether taste preference was altered by that experience…The authors' conclusion would be strengthened by adding an intervening brief access test between recording days 1 and 2.

      We very much appreciate Reviewer 3’s suggestion, but the primary authors involved in data collection on this project have moved on, and we won’t be able to collect the additional dataset that would be required. Instead, we will soften the conclusion that we reach in the last section, and suggest this experiment as a future direction.

      The current experimental design exposes animals to 3 distinct sets of substances … [that] differ in identity … and concentration. Because palatability is known to be comparative depending on the other substances available and concentration-dependent, this introduces challenges to interpretation, [and] without more clarity, it is difficult to evaluate whether the interaction of different tastes within the sets of stimuli biases the main conclusions.

      This is an interesting point. We hope that some of the work that we are undertaking in response to Reviewers 1 & 2 (see above) will shed light on whether there is any non-randomness in between-session preference changes; such non-randomness would imply that we might want to conclude that preferences change more with one battery than another. But we will perform a more direct test of this hypothesis, breaking the dataset apart and asking whether our phenomena are observed more with one battery than another. If it turns out that the magnitude of the impact of experience does depend on the nature of the taste battery (we predict not, for reasons that are in the manuscript), we shall introduce that complexity into our interpretation, and the Discussion thereof.

      Responses to sweet tastes are not reported in the electrophysiology data. This is seemingly the case because rats given set 1 received no sweet stimulus while rats given set 2 received to 2 distinct sweet tastes. Finally, rats given set 3 did not receive quinine, yet quinine is reported in electrophysiology data.

      We are unsure of the source of this confusion—in every case, the rat received the same tastes in the electrophysiology sessions that were delivered in the BAT preference tests—but we will modify the text to ensure: 1) that panels reflecting data from a single rat (panels that will therefore necessarily include only a subset of possible tastes) are clearly marked as such; and 2) that the nature of which taste batteries were delivered is more explicit.

      The choice of reporting average lick cluster size is problematic because the authors use thirsty rats with 10-second-long trials. Thirsty rats are likely to lick in relatively long clusters, especially for neutral and palatable tastes. If the rat is mid-cluster when the trial ends, the final cluster would be cut off prematurely, resulting in shorter overall average lick cluster size, disproportionately affecting neutral and palatable tastes over aversive tastes.

      We have ourselves been deeply concerned with this issue; we have recently published a paper that includes within it a direct test demonstrating that calculations of lick bout lengths from 10-sec BAT trials result in taste palatability estimates that are identical to (and less noisy than) those generated from more classically-used 15-min ad lib licking. We will cite this paper (Lin, et al., 2026) in the Methods section of the revision, along with text clarifying how we calculated lick clusters. That said, we are also planning to conduct an additional analysis that estimates taste preference after removing these “premature bouts” and will evaluate how this recalculation affects our results.

      Of course, even if 10-sec BAT trial data DIDN’T provide reliable preference measures, the result of clusters being cut short by the end of a trial would be an underestimation of the preference for the palatable tastes (which drive far more licking than aversive tastes and are therefore more likely to be mid-bout at the end of a trial). Such an underestimation would in turn be expected to reduce the observed neural-behavioral correlation. This fact actually highlights the robustness of our findings.

      Canonical palatability rankings may not apply to the concentrations selected in every stimulus set. This is particularly true for set 1, which included two concentrations of citric acid and quinine for the behavior. It is also not clear which concentrations are reported in Figures 3A2 and 3B2. Meanwhile, the concentrations of quinine and citric acid used for electrophysiology are quite low.

      In the revision Methods section, we will explicitly motivate our reasoning behind canonical rankings for each taste battery used (the added text will include citations). We have also added to the Discussion section prose concerning the possible impact of possibly getting those rankings wrong—i.e., the impact is minimal, given that our results are largely driven by differences between rats (and day-to-day differences within rat), and the resultant fact that almost any choice of canonical rankings would poorly reflect the behavior of individual rats on individual days.

    1. eLife Assessment

      This study presents valuable insights into cellular sites of monoamine production and presence in Pristionchus pacificus, providing a comparative reference for the detailed knowledge of C. elegans, as well as using this information to compare serotonergic anatomy in 22 nematode species. Functional assays support evolved differences in monoaminergic control over certain, but not all, tested behaviors. The evidence is convincing, combining careful genetic experiments and comparative analysis that are well aligned with the conclusions. The results will serve as a basis for (comparative) structural-functional studies of nematode behavior.

    2. Reviewer #1 (Public review):

      Summary:

      The authors provide extensive immunoreactivity and expression data to map monoaminergic neurotransmitter production sites in Pristionchus pacificus. This nematode is relatively distantly related to the popular model nematode Caenorhabditis elegans, for which such information is already available. They find that dopamine, tyramine, and octopamine are present in the same neurons in both species, but differences are observed for serotonin. This forms the basis for a comparison of serotonergic neurons across 22 nematode species. In addition, they evaluate monoaminergic effects on egg-laying, head movement during reversals, and nictation behavior, to find that monoaminergic control over the latter differs between C. elegans and P. pacificus. This shows that some anatomical flexibility supports similar outcomes, whereas in other cases it is the basis of evolved regulatory differences.

      Strengths:

      The comparative efforts are laudable and valuable, including a thorough revisiting of old data and corrections of what is judged as a historic misannotation. The expected continued value of this work is also appreciated, because nematodes have similar anatomies and behaviors, cellular-resolution data of different species permits the study of functional evolution of neurotransmitter usage in homologous neurons.

      Despite the strong experimental approach, there are some points that require addressing:

      (1) Not all the concepts of the introduction ('feeding behaviors', to a lesser extent also 'evolution of neurotransmitter usage in homologous neurons') are followed up upon in the results or discussion sections.

      (2) The choice of nematodes ('only' 13 species) may affect what is perceived as ancestral. Also, identifying their cells based on comparisons with Ce or Ppa identifications only is understandable but mildly risky: there are many cells in the head, and mistakes would go unnoticed until detailed analysis in each species can provide conclusive evidence.

      (3) It is not reported whether the nictation-defective mutants have general locomotion defects; therefore, whether the reported problem is specific to this host-finding behavior or not.

      (4) The section on RIP neurons makes sense for Ppa, but not for Ce (dauers in fact have weakened IL2-to-RIP connections), and should be revised. The nictation data also do not support the breadth of the conclusions, which should either be toned down or rephrased as hypothetical.

      (5) The discussion mostly reiterates the results, leaving little room for the author's interpretations and opinions. I would suggest reworking in favor of conceptual discussion.

    3. Reviewer #2 (Public review):

      Summary:

      This paper makes important contributions to our understanding of how nervous systems evolve, with a particular focus on whether changes in neurotransmitter usage within homologous neurons represent a mechanism for evolutionary adaptation without large-scale changes to circuitry. Comparing the predatory nematode P. pacificus with C. elegans, this study systematically examines monoamine-producing neurons, assesses how their neurotransmitter identities differ between homologous neural types, and determines how these differences relate to behavior.

      Strengths:

      The major strength of this work is its breadth, rigor, and data quality. It combines multiple, independent lines of evidence to assign neurotransmitter identity for neurons with homology grounded in lineage, morphology, and connectomics, which is essential for meaningful cross-species comparisons. Additionally, by extending the analysis beyond P. pacificus and C. elegans to other nematodes, the authors convincingly argue that features observed in P. pacificus likely reflect an ancestral state. This depth greatly enhances the significance of the conclusions.

      This work is likely to have a significant impact on the fields of comparative neurobiology and nervous system evolution. It demonstrates a powerful system and approach for linking molecular identity, cell-type homology, circuit context, and behavior across species. The data generated here will be a valuable resource for the community and provide a strong foundation for future mechanistic studies.

      More broadly, the study reinforces the idea that evolutionary change in nervous systems can occur through modulation of chemical signaling within conserved circuits, rather than through complete rewiring. This conceptual framework is likely to influence how researchers think about neural evolution in other systems.

      Weaknesses:

      Given the availability of detailed connectivity information for both species, a more explicit comparison of the local circuit context of key neurons would further strengthen the link between molecular identity and circuit function.

    4. Reviewer #3 (Public review):

      Summary:

      The study by Hong, Loer, Hobert, and colleagues is a comprehensive description of monoaminergic neurons in the nematode Pristionchus pacificus. The work used multiple, complementary approaches, including immunostaining and expression of genes involved in neurotransmitter synthesis or transport, to identify neurons that express a monoamine neurotransmitter. Moreover, this study characterized the phenotypes of various mutants to study their organismal function. Extensive comparisons are made to C. elegans, the nematode model that, in a way, anchors the model studied here, and new outgroup species were examined for some features so that the polarity of their evolution could be inferred. Although there is no simple or groundbreaking punchline to distill from the manuscript (i.e., other than some things are the same as in C. elegans, and some things are different), and while the study is basically descriptive in nature, the scope of the project warrants broad attention.

      Strengths:

      This manuscript offers a tremendous resource for those who use this species as a model, which, based on the author list alone, includes many labs. This study sets the bar for what can be done in a "satellite" model system.

      Given the complementarity of approaches used, such as the position of cell bodies, the connectivity and morphology of dendrites, and a previously published atlas of the connectome for this species, the identification of specific neurons (which, as the authors point out, can be easily mistaken) is convincing throughout. Likewise, appropriate caution is observed where neuron identities are ambiguous, e.g., unlabelled cells in Figure 5, or ambiguous identities in other species, as shown in Figure 10. There was a lot of data to unpack in this manuscript, but I could not find any obvious flaws in neuron identification.

      Also, the phenotypic assays were straightforward and informative.

      Weaknesses:

      No serious weaknesses were noted. One minor comment is that in general, I think the Methods could use some additional text to describe what the goal of any given technique was. For example, although there is a description of the HCR protocol in the methods, nowhere does it say what genes this method would be used for. In addition to what is shown in Figure 4, this information should be given in the Methods.

    5. Author response:

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      The authors provide extensive immunoreactivity and expression data to map monoaminergic neurotransmitter production sites in Pristionchus pacificus. This nematode is relatively distantly related to the popular model nematode Caenorhabditis elegans, for which such information is already available. They find that dopamine, tyramine, and octopamine are present in the same neurons in both species, but differences are observed for serotonin. This forms the basis for a comparison of serotonergic neurons across 22 nematode species. In addition, they evaluate monoaminergic effects on egg-laying, head movement during reversals, and nictation behavior, to find that monoaminergic control over the latter differs between C. elegans and P. pacificus. This shows that some anatomical flexibility supports similar outcomes, whereas in other cases it is the basis of evolved regulatory differences.

      Strengths:

      The comparative efforts are laudable and valuable, including a thorough revisiting of old data and corrections of what is judged as a historic misannotation. The expected continued value of this work is also appreciated, because nematodes have similar anatomies and behaviors, cellular-resolution data of different species permits the study of functional evolution of neurotransmitter usage in homologous neurons.

      Despite the strong experimental approach, there are some points that require addressing:

      (1) Not all the concepts of the introduction ('feeding behaviors', to a lesser extent also 'evolution of neurotransmitter usage in homologous neurons') are followed up upon in the results or discussion sections.

      We will address the relative treatment of particular topics in the introduction and discussion in a revised version of the article.

      (2) The choice of nematodes ('only' 13 species) may affect what is perceived as ancestral.

      See above regarding ‘13 species’ (actually 22). Most species and genera were specifically selected previously (Loer and Rivard, 2007; Rivard et al., 2010) for broad phylogenetic coverage, representing different species and genera in 4 major clades within ‘clade V’ (Kiontke et al., 2007; Sudhaus, 2011): Anarhabditis (Caenorhabditis, including both the Elegans and Drosophilae species groups), Synrhabditis (Oscheius, Metarhabditis, Reiterina and Rhabditella), Pleiorhabditis (Teratorhabditis, Mesorhabditis, Rhomborhabditis and Pelodera), and Diplogastrids represented by P. pacificus. Among the outgroups to clade V, there are 3 distinct clades represented, each with at least two species and/or genera represented. Therefore, we believe that the determination of an ancestral condition is well-founded. We plan to add this rationale to the revised version to make this clearer.

      (2, continued) Also, identifying their cells based on comparisons with Ce or Ppa identifications only is understandable but mildly risky: there are many cells in the head, and mistakes would go unnoticed until detailed analysis in each species can provide conclusive evidence.

      We agree that there is a mild risk of incorrect identification but believe that appropriate caveats are noted in the text. Furthermore, the recent head EM reconstruction and complete embryonic cell lineage of the P. pacificus (Cook et al., 2025) shows a nearly 1-1 homology correspondence between head neurons (e.g., only a single head neuron is missing in the Ppa head relative to Cel due to altered apoptosis), and a quite high level of conservation of neurite morphology and soma position between Cel and Ppa suggests that identifications are likely correct when examining related nematodes. In cases for which a serotonin-immunoreactive cell is found in the predicted location (and often having apparent associated neurites), its homology to the matching Cel and Ppa cell is the most parsimonious interpretation: otherwise, one cell would have to lose expression and another nearby cell gain it.  

      (3) It is not reported whether the nictation-defective mutants have general locomotion defects; therefore, whether the reported problem is specific to this host-finding behavior or not.

      None of the mutants we tested for nictation behavior, including those that show severe defects in nictation (Ppa-cat-1, Ppa-tph-1, Ppa-tdc-1, Ppa-tbh-1), exhibited noticeable general locomotion defects either as dauers or non-dauers. Further clarification will be provided in a revised version of the article.

      (4) The section on RIP neurons makes sense for Ppa, but not for Ce (dauers in fact have weakened IL2-to-RIP connections) and should be revised. The nictation data also do not support the breadth of the conclusions, which should either be toned down or rephrased as hypothetical.

      We plan to address these concerns in a revised version of the article.

      (5) The discussion mostly reiterates the results, leaving little room for the author's interpretations and opinions. I would suggest reworking in favor of conceptual discussion.

      As noted above, we agree to address the relative treatment of matters in discussion in a revised version of the article.

      Reviewer #2 (Public review):

      Summary:

      This paper makes important contributions to our understanding of how nervous systems evolve, with a particular focus on whether changes in neurotransmitter usage within homologous neurons represent a mechanism for evolutionary adaptation without large-scale changes to circuitry. Comparing the predatory nematode P. pacificus with C. elegans, this study systematically examines monoamine-producing neurons, assesses how their neurotransmitter identities differ between homologous neural types, and determines how these differences relate to behavior.

      Strengths:

      The major strength of this work is its breadth, rigor, and data quality. It combines multiple, independent lines of evidence to assign neurotransmitter identity for neurons with homology grounded in lineage, morphology, and connectomics, which is essential for meaningful cross-species comparisons. Additionally, by extending the analysis beyond P. pacificus and C. elegans to other nematodes, the authors convincingly argue that features observed in P. pacificus likely reflect an ancestral state. This depth greatly enhances the significance of the conclusions.

      This work is likely to have a significant impact on the fields of comparative neurobiology and nervous system evolution. It demonstrates a powerful system and approach for linking molecular identity, cell-type homology, circuit context, and behavior across species. The data generated here will be a valuable resource for the community and provide a strong foundation for future mechanistic studies.

      More broadly, the study reinforces the idea that evolutionary change in nervous systems can occur through modulation of chemical signaling within conserved circuits, rather than through complete rewiring. This conceptual framework is likely to influence how researchers think about neural evolution in other systems.

      Weaknesses:

      Given the availability of detailed connectivity information for both species, a more explicit comparison of the local circuit context of key neurons would further strengthen the link between molecular identity and circuit function.

      We plan to address these concerns in a revised version of the article.

      Reviewer #3 (Public review):

      Summary:

      The study by Hong, Loer, Hobert, and colleagues is a comprehensive description of monoaminergic neurons in the nematode Pristionchus pacificus. The work used multiple, complementary approaches, including immunostaining and expression of genes involved in neurotransmitter synthesis or transport, to identify neurons that express a monoamine neurotransmitter. Moreover, this study characterized the phenotypes of various mutants to study their organismal function. Extensive comparisons are made to C. elegans, the nematode model that, in a way, anchors the model studied here, and new outgroup species were examined for some features so that the polarity of their evolution could be inferred. Although there is no simple or groundbreaking punchline to distill from the manuscript (i.e., other than some things are the same as in C. elegans, and some things are different), and while the study is basically descriptive in nature, the scope of the project warrants broad attention.

      Strengths:

      This manuscript offers a tremendous resource for those who use this species as a model, which, based on the author list alone, includes many labs. This study sets the bar for what can be done in a "satellite" model system.

      Given the complementarity of approaches used, such as the position of cell bodies, the connectivity and morphology of dendrites, and a previously published atlas of the connectome for this species, the identification of specific neurons (which, as the authors point out, can be easily mistaken) is convincing throughout. Likewise, appropriate caution is observed where neuron identities are ambiguous, e.g., unlabeled cells in Figure 5, or ambiguous identities in other species, as shown in Figure 10. There was a lot of data to unpack in this manuscript, but I could not find any obvious flaws in neuron identification.

      Also, the phenotypic assays were straightforward and informative.

      Weaknesses:

      No serious weaknesses were noted. One minor comment is that in general, I think the Methods could use some additional text to describe what the goal of any given technique was. For example, although there is a description of the HCR protocol in the methods, nowhere does it say what genes this method would be used for. In addition to what is shown in Figure 4, this information should be given in the Methods.

      More detailed methods will be provided in a revised version of the article.

    1. eLife Assessment

      This study presents valuable findings on how retrieval practice protects memory inferences from stress via covert memory reactivation. Via two EEG experiments manipulating stress and retrieval practice, the authors provide solid evidence supporting the conclusion. This work will be of interest to cognitive and affective neuroscientists working on the intersection between memory and stress.

    2. Reviewer #1 (Public review):

      Summary:

      This manuscript examines whether retrieval practice protects memory-based inference from acute stress and proposes rapid neural reactivation of a bridging memory element as the underlying mechanism. Using a two-day associative inference paradigm combined with EEG decoding, the authors report that stress impairs inference accuracy and speed, while retrieval practice eliminates these deficits and restores neural signatures associated with bridge-element reactivation. The study addresses an important and timely question by integrating research on retrieval-based learning, stress effects on memory, and neural dynamics of inference. While the work provides promising multi-level evidence linking behavioral and neural findings, limitations in experimental design, causal interpretation, and decoding specificity weaken the strength of the mechanistic claims and suggest that further work is needed to disentangle strengthened associative memory from inference-specific protection effects

      Strengths:

      (1) Strong theoretical integration<br /> The study integrates three influential frameworks: memory integration through associative inference, stress-induced retrieval impairment, and the testing effect. The authors present a clear theoretical narrative linking these domains and derive testable hypotheses that retrieval practice protects inference by strengthening neural reactivation of a bridge element. The conceptual framing is well-grounded in prior literature and addresses an important gap regarding neural dynamics during inference.

      (2) Multi-level evidence<br /> The study provides converging behavioral and neural evidence. The authors demonstrate that stress reduces inference accuracy and speed, while retrieval practice eliminates these deficits. EEG decoding further suggests that bridge element reactivation predicts successful inference. The combination of behavioral performance and neural decoding strengthens the overall argument.

      (3) Transparent experimental implementation<br /> The procedures are described in substantial detail, including stimulus construction, stress manipulation, and decoding pipelines. Data and code availability are also strengths, facilitating reproducibility.

      Weaknesses:

      (1) Insufficient evidence that retrieval practice specifically protects inference rather than strengthening associative memories

      A central claim of the manuscript is that retrieval practice specifically protects inference ability rather than simply strengthening underlying associative memories. However, the current data do not convincingly distinguish between these possibilities. Although the authors limited analyses to trials in which AB and BC pairs were correctly retrieved in the subsequent memory test, this procedure does not fully rule out the possibility that improved inference performance reflects stronger base associative memories rather than enhanced integrative processes.

      Importantly, the direct memory retrieval test used a two-alternative forced-choice (2AFC) format, which inherently allows a substantial proportion of correct responses to arise from guessing. Consequently, trials classified as "successfully retrieved" may still include weak associative memory traces, making it difficult to conclude that failures in inference reflect deficits in integration rather than incomplete associative learning.

      The authors further argue that retrieval practice does not improve inference in the absence of stress, suggesting independence between inference and associative memory strength. However, this null effect does not sufficiently rule out mediation through strengthened premise memory. A factorial design and/or mediation analysis would be necessary to determine whether inference resilience emerges independently of premise memory strength.

      (2) Apparent below-chance inference performance raises interpretational concerns

      One surprising aspect of the results is that inference performance across experiments and groups appears to fall below the theoretical chance level (0.33) in Figure 4A. This is particularly unexpected because analyses were restricted to trials in which participants correctly retrieved both AB and BC associations.

      If performance is indeed below chance, this raises concerns regarding whether participants fully understood the task instructions or whether other methodological factors influenced performance. Additionally, below-chance performance complicates the interpretation of subsequent behavioral and neural analyses. It is possible that this reflects my misunderstanding of the figure; therefore, clarification from the authors regarding how inference accuracy is calculated and presented would be helpful.

      (3) Between-experiment implementation of retrieval practice weakens causal inference

      The retrieval practice manipulation was implemented as a separate experiment rather than as part of a factorial design. Experiment 2 was conducted after results from Experiment 1 were known, and the authors acknowledge this post hoc decision. This design introduces several potential confounds, including cohort differences between experiments, possible differences in participant motivation or task familiarity, and reduced ability to rigorously test interaction effects.

      Although the authors combined data across experiments to test interactions between stress and retrieval practice, such post hoc aggregation cannot fully substitute for a factorial design. A within-experiment 2 × 2 design (Stress × Retrieval Practice) would provide substantially stronger causal evidence and reduce confounding influences.

      (4) Lack of an appropriate comparison condition for retrieval practice limits the interpretation of the mechanism

      Although acknowledged briefly in the discussion, the absence of an appropriate comparison condition for retrieval practice represents a critical limitation. Without a matched re-exposure or restudy control condition, it remains unclear whether observed benefits are attributable specifically to retrieval practice or to additional exposure to AB and BC associations.

      Furthermore, it is unclear whether retrieval practice operates at the trial level or the participant level. Retrieval practice could enhance memory representations for specific practiced items, making those trials more resistant to stress, or it could induce a more global change in cognitive strategy or stress resilience across participants. One way to address this issue would be to analyze inference performance separately for trials that were successfully retrieved during the retrieval practice phase versus those that were not.

      (5) Interpretation of EEG decoding as bridge-element reactivation may be overstated

      The neural decoding results form the mechanistic foundation of the manuscript; however, the interpretation that decoding reflects reactivation of specific bridging memories may be overstated. The classifier distinguishes between face and building categories, and because the bridging element belongs to one of these categories, successful decoding may reflect category-level semantic activation rather than reinstatement of item-specific episodic representations.

      Alternative explanations include category-level retrieval, strategic task differences, or even attentional biases. Because only two categories were used, the decoding analysis lacks the specificity necessary to distinguish between category-level and item-level reactivation. As such, conclusions regarding the reinstatement of specific bridging memories should be tempered or supported with additional analyses.

    3. Reviewer #2 (Public review):

      Summary:

      Guo et al. investigate the neural and behavioral mechanisms of stress-induced impairments in memory-based inference. Across two well-powered experiments (N=136), the authors demonstrate that acute stress disrupts the rapid neural reactivation of "bridge" elements necessary for novel inferences. Crucially, they identify retrieval practice as a robust behavioral buffer that restores both inferential performance and the underlying neural signatures of memory reactivation.

      Strengths:

      (1) The use of two independent experiments provides high confidence in the behavioral findings.

      (2) Utilizing time-resolved EEG decoding allows the authors to pinpoint the "online" moment of inferential failure, a significant advancement over the lower temporal resolution of fMRI.

      Weaknesses:

      (1) The authors correctly timed the inference task to begin approximately 20 minutes after the onset of the stressor. While this window aligns with the expected peak of the glucocorticoid (HPA) response, it also represents a period where the rapid adrenergic (SAM) response, confirmed by heart rate elevation, is still highly influential. As the authors acknowledge, because they did not collect saliva samples due to safety protocols, they cannot definitively separate the influence of peak cortisol from the tail-end of the adrenergic surge on the observed memory impairments.

      (2) Figures 4 and 6: Without asterisks is really difficult to compare the significant group differences.

      Appraisal and Impact:

      This study provides high-quality evidence that acute stress impairs the rapid neural reactivation of "bridge" elements necessary for novel memory-based inferences. By leveraging the high temporal resolution of EEG decoding, the authors identify the specific neural "chokepoint" where inferential failure occurs. The research is strengthened by two independent experiments and the identification of retrieval practice as a powerful buffer that not only preserves but also enhances neural reactivation under pressure. The findings have significant implications for both cognitive neuroscience and applied learning science.

    4. Reviewer #3 (Public review):

      Summary:

      In this study, Guo and colleagues investigated the effects of stress and retrieval practice on memory inference. In the first experiment, they found that memory inference was significantly worse after induced stress. Conversely, when participants received retrieval practice in the second experiment, they found no significant differences between these conditions. They monitored EEG during the inference phase and applied multivariate decoding analysis to examine evidence of neural reactivation. Complementing the behavioural findings of the first experiment, they found that they were able to decode the stimulus category of the inference item with more fidelity in the no stress condition. Surprisingly, they found the opposite direction when participants had retrieval practice, with stronger evidence of reactivation in the stress condition than in the control condition.

      Strengths:

      (1) The authors have carefully designed two studies investigating the effects of stress and memory retrieval on memory inference.

      (2) The use of multivariate decoding on the inference phase data sheds new light on how stress and retrieval may impact the neural signatures of inference processing.

      Weaknesses:

      (1) There are some key gaps in the reporting of the data. In particular, data is missing on how many trials were included in the inference phase and how many were retrieved in the direct memory task. This is important to know as the main conclusions are based on inference trials proportional to the direct retrieval trials. Considering that the direct retrieval performance differs significantly between the experiments, there could be issues with floor/ceiling effects (in the behaviour) and statistical power (in the EEG results) that confound the comparisons between experiments. Without the data, it is difficult to draw conclusions.

      (2) There are some relatively strong conclusions drawn without the data to support them. An important example is the title suggesting a mechanistic role of memory reactivation for these effects; however, the data instead suggest a relationship between successful inference and evidence of reactivation. Additionally, one-tailed t-tests have been used in follow-up tests, and, as I understand it, no multiple comparisons corrections have been applied to the post-hoc tests, suggesting that these findings should be interpreted with caution.

      (3) In places, the structure is unclear, making the narrative difficult to follow, often making it necessary for the reader to go back and forth between the sections to understand the study and analyses. I have made some recommendations for how to improve this.

    5. Author response:

      Public reviews:

      Reviewer #1 (Public review):

      (1) We agree that the current design does not allow us to cleanly dissociate whether the beneficial effect of retrieval practice on AC inference under stress reflects a selective enhancement of inferential processing or, instead, stronger memory for the underlying AB and BC premise pairs that supports later inference. We plan to revise the manuscript to remove wording that could be read as claiming that retrieval practice specifically protects inference independently of associative-memory strengthening.

      Our intended interpretation is more modest. As shown in Section 3.2.3, retrieval practice improved direct premise-memory performance, consistent with the well-established testing effect. In the present paradigm, successful AC inference necessarily depends on access to the AB and BC premise associations. Accordingly, strengthened premise memory is not an alternative explanation that can be excluded by our data, but rather a plausible mechanism through which retrieval practice may promote more resilient inference performance under stress.

      Because AC inference in our paradigm necessarily depends on retrieving and linking the AB and BC premise pairs, strengthened premise memory is not merely a competing explanation that can be separated from inference performance in the current dataset. Rather, it is a plausible mechanism through which retrieval practice may support inference, especially under stress. We therefore will revise the manuscript to avoid implying that retrieval practice protects inferential processing independently of associative-memory strengthening, and instead interpret the effect more conservatively as reflecting enhanced premise representations and/or more effective reactivation of bridge information during inference.

      We also agree that the post-inference direct memory test, which used a 2AFC format, provides only a coarse measure of premise-memory strength and allows some proportion of correct responses to arise from guessing. Therefore, restricting analyses to trials in which AB and BC were later answered correctly does not fully guarantee that those trials were supported by strong associative memories. We will acknowledge this limitation explicitly in the manuscript and have tempered our interpretation of these “successfully retrieved” premise trials accordingly. More stringent measures, such as cued recall, confidence-based memory judgments, or other continuous indices of premise-memory strength, would be better suited to this question in future work.

      Finally, we agree that the absence of a retrieval-practice benefit in the non-stress condition does not by itself rule out mediation through strengthened premise memory. Because the retrieval-practice manipulation was introduced in a follow-up study after completion of Study 1, the present dataset was not designed as a single fully crossed factorial experiment. In response to the reviewer’s suggestion, we will add an exploratory mediation analysis testing whether premise-memory performance statistically accounts for the relationship between retrieval practice and inference performance. We will report this analysis cautiously, given that premise memory was assessed using a post-inference 2AFC measure, and we note in the manuscript that a future fully crossed design with more sensitive premise-memory measures will be needed for a stronger test.

      (2) We apologize that the presentation of Figure 4A was not sufficiently clear and may have created the impression of below-chance inference performance. The values shown in Figure 4A do not represent raw 3-alternative forced-choice (3AFC) A-C inference accuracy, for which the theoretical chance level would be 0.33. Instead, Figure 4A plots a normalized inference index, calculated as inference performance relative to direct retrieval performance, to account for individual differences in the availability of the directly learned premise pairs. Therefore, the raw 3AFC chance level is not the appropriate reference for interpreting this measure. To avoid this confusion, we will clarify in the revised manuscript and figure legend that Figure 4A shows a normalized inference index rather than raw inference accuracy.

      (3) We agree that implementing retrieval practice in a separate experiment, rather than within a single 2 × 2 factorial design, limits the strength of the causal inference regarding retrieval practice and reduces our ability to formally test the retrieval practice × stress interaction within one unified design.

      In response, we will revise the manuscript to more explicitly acknowledge this limitation and to temper our interpretation throughout. Specifically, we now avoid overstating retrieval practice as definitively preventing the effects of stress, and instead describe the findings more cautiously as evidence that retrieval practice was associated with attenuation of stress-related inference impairments across experiments. We also will add a limitation statement in the Discussion noting that the current design cannot fully rule out cohort-related confounds and that a fully crossed factorial design will be necessary in future work to provide a more rigorous test of the interaction between retrieval practice and stress.

      At the same time, we have clarified that the two experiments were conducted under closely matched conditions: participants were recruited using the same protocol from the same campus population, demographic characteristics were matched, and both experiments were run in the same laboratory using the same EEG system, task procedures, and experimenter team. We agree, however, that these procedural consistencies reduce but do not eliminate the concern about between-experiment confounds.

      (4) We agree that the absence of a matched re-exposure/restudy control condition limits the mechanistic interpretation of the retrieval-practice effect. In the revised manuscript, we will make this limitation more explicit in the Discussion and temper our conclusions accordingly. Specifically, we clarify that the present design shows that a post-encoding retrieval-practice intervention buffered the impact of acute stress on later inference, but it does not allow us to determine whether this benefit is specific to retrieval practice per se, rather than to additional exposure to the AB and BC associations.

      We also agree that it is important to distinguish whether the effect operates at the level of specific practiced items or reflects a more global participant-level effect. In the current study, however, the retrieval-practice phase in Experiment 2 was implemented as a brief timed free-recall procedure rather than a trial-by-trial cued retrieval task, and the available records do not allow us to reliably link retrieval-practice success for individual associations to specific later AC inference trials. Therefore, we cannot directly compare later inference performance for successfully versus unsuccessfully retrieved items on a trial-by-trial basis.

      To address this issue as far as possible with the current dataset, we instead plan to conduct an additional item-level robustness analysis using mixed-effects models that accounted for variability across ABC associations. Specifically, we tested whether the critical stress-by-retrieval-practice effect remained after modeling triad-level variability, and whether there was evidence that this effect differed substantially across triads. This analysis does not provide a direct test of whether successfully retrieved items benefit more than unsuccessfully retrieved items, but it does help assess whether the observed effect is broadly distributed across associations or driven by only a small subset of items.

      (5) We agree that our current decoding approach does not justify a strong claim of item-specific reinstatement of a unique bridge memory. The classifier was trained to discriminate stimulus categories (faces vs. buildings) in the independent localizer and then applied during the inference phase. Therefore, the present analysis is better interpreted as indexing reactivation of bridge-related category information, rather than reinstatement of an item-specific episodic representation.

      Importantly, however, we believe this signal remains theoretically informative for the inferential process examined here. In our design, the bridge element B belonged to one of the trained categories, and the classifier was applied during the cue period when no face or building stimulus was physically present. Thus, successful decoding in this time window suggests that task-relevant bridge-related information was re-expressed online during inference, rather than reflecting concurrent perceptual processing. At the same time, we agree that, because only two categories were used, the decoding analysis cannot fully dissociate bridge-related category reactivation from broader category-level retrieval, strategic task differences, or attentional contributions.

      To address this concern, we plan to revise the manuscript in three ways. First, we will soften the interpretation throughout the Results and Discussion to avoid claims of item-specific bridge-memory reinstatement. Second, we now refer to the decoding effect more conservatively as bridge-related or category-level mnemonic reactivation during inference. Third, we have added an explicit limitation stating that the current design does not allow us to distinguish item-specific episodic reinstatement from category-level reactivation, and that future work using more fine-grained representational analyses and/or a larger stimulus set will be needed to resolve this issue more directly.

      Reviewer #2 (Public review):

      (1) We agree with this important point. The inference task was scheduled to begin approximately 20 minutes after stress onset based on prior human stress literature, with the intention of probing a time window commonly associated with glucocorticoid effects. However, as the reviewer notes, this period may also still reflect residual adrenergic/SAM influences. Because salivary cortisol was not collected due to the COVID-19-related safety protocol, we cannot disentangle the relative contributions of glucocorticoid and adrenergic responses to the observed stress-related effects on inference and neural reactivation. We will revise the manuscript to make this limitation more explicit in the Discussion and to avoid attributing the effects to a specific physiological component of the stress response.

      (2) In the revised manuscript, we will add asterisks (or equivalent significance annotations) to Figures 4 and 6 to improve clarity and readability.

      Reviewer #3 (Public review):

      (1) We thank the reviewer for highlighting this important reporting issue. We agree that the number of trials contributing to the behavioral and EEG analyses should be reported more explicitly, particularly because inference performance was analyzed in relation to direct retrieval performance and because direct retrieval differed across experiments.

      In the revised manuscript, we will report, for each group and experiment, the number of trials presented in the AC inference phase, the number of trials retained for the behavioral analyses, and the number of successfully retrieved direct-memory trials in the AB and BC tasks. These values will be summarized in the revised Results section and in Supplementary Tables.

      To directly address the reviewer’s concern, we will also compared trial counts across groups/experiments and evaluated whether differences in direct retrieval performance could account for the inference and EEG effects. To further address the concern about potential unequal trial numbers, we plan to repeat the analyses such as trial-count-matched subsets analyses to see whether results remained qualitatively unchanged.

      (2) We thank the reviewer for this important comment. We agree that our original title and some parts of the manuscript used language that was stronger than warranted by the data. Our results show that rapid reactivation of the bridge element is associated with successful inference and is modulated by stress and retrieval practice, but they do not by themselves establish a causal mechanistic role for reactivation. We therefore plan to revise the title and softened the relevant wording throughout the manuscript to better reflect the correlational nature of this evidence.

      Specifically, we plan to change the title from “Retrieval practice prevents stress-induced inference impairment by restoring rapid memory reactivation” to “for example, Retrieval practice prevents stress-induced inference impairment and preserves rapid bridge-item memory reactivation” We also revised the Abstract, Results, and Discussion to replace stronger mechanistic wording such as “prevents,” “restoring,” and “essential neural mechanism” with more cautious phrasing such as “buffers” or “attenuates,” “preserves” or “is associated with,” and “neural correlate” or “candidate process,” as appropriate. This revision will led us to temper the overall interpretation of the EEG findings: rather than claiming that reactivation is the mechanism by which retrieval practice prevents stress-related inference deficits, we now conclude that rapid bridge-item reactivation is a neural correlate of successful inference that is sensitive to stress and enhanced by retrieval practice.

      We also appreciate the reviewer’s concern regarding the use of one-tailed follow-up tests and the absence of multiple-comparison correction. With respect to the one-tailed t-tests, these follow-up comparisons were conducted because the relevant hypotheses were directional a priori. Based on prior work and our theoretical framework, we specifically predicted that acute stress would impair inference-related performance and neural reactivation, and that retrieval practice would mitigate these effects. The follow-up tests were therefore not exploratory post-hoc comparisons, but planned tests used to decompose the significant omnibus effects in the predicted direction. For this reason, we considered one-tailed testing appropriate for these comparisons.

      Similarly, we did not apply an additional multiple-comparison correction to these planned follow-up tests because they were limited in number, theory-driven, and conducted to evaluate specific directional predictions rather than to search broadly across many possible contrasts. Importantly, our interpretation does not depend on any isolated post-hoc comparison, but on the consistency of the results across behavioral inference measures, neural decoding of bridge-item reactivation, and theta-band analyses. We have revised the manuscript to make this rationale clearer and to ensure that the follow-up results are interpreted in the context of the full pattern of evidence.

      (3) We agree that, in the previous version, parts of the manuscript were not structured clearly enough, which may have made it difficult for readers to follow the logic of the study and the sequence of analyses without moving back and forth across sections. In the revised manuscript, we will reorganize the presentation to improve the overall narrative flow and readability. Specifically, we plan to clarify the study logic and analysis sequence, strengthened transitions between sections, and revised the relevant text in line with the #reviewer3’s detailed suggestions.

    1. eLife Assessment

      In this fundamental work Horne et al present compelling evidence that YbjP is a novel binding partner of the TolC channel protein. The YbjP is characterized using cryo-EM, and its role probed using pull-down experiments, in vivo crosslinking, functional assays along with phylogenetic analysis which are all properly performed and presented and support the main conclusions. While the study does not identify a clear role for this protein, the results contribute to the understanding of this complex system and will be of interest to those working in the fields of membrane transport and antimicrobial resistance.

    2. Reviewer #1 (Public review):

      Summary:

      The authors report a novel binding partner of the TolC channel protein that forms complexes with the two principal classes of transporter-based tripartite assemblies (both ABC- and RND-transporter based) and appears to modulate their function, while also anchoring TolC into the outer membrane, compensating for the lack of direct lipidation seen in other members of the OMF family.

      The newly identified protein, YbjP, is comprehensively characterized from both phylogenetic and structural perspectives. Two independent cryo-EM structures (MacAB-TolC-YbjP and AcrABZ-TolC-YbjP) provide strong structural evidence for its role and are generated using peptidiscs, mimicking the membrane environment. These findings are further supported by pull-down experiments (including state-of-the-art in vivo photo crosslinking) and functional assays for a well-rounded characterisation of the protein, and a significant amount of modelling and phylogenetic analysis. This work sheds light on the function of the members of the DUF3828-containing protein family, which appear to anchor TolC to the outer membrane and influence the expression of the TnaB and YojI transporters.

      Strengths:

      The strengths of the manuscript are numerous, and it presents a well-rounded package of structural biology complemented by functional and computational studies.

      The full assemblies of both MacAB-TolC-YbjP and AcrABZ-TolC-YbjP are reconstituted and resolved to near-atomic resolution using cryo-EM for unambiguous assignment of binding interfaces, which are then validated using a number of techniques, including ITC, in vitro and in vivo binding assays and cross-linking.

      The evolutionary analysis is particularly notable, and provides genuine insight into the DUF3828-containing proteins, the function of which remains enigmatic till now. Similarly, the involvement of YbjP in trafficking of TolC and the analysis of the impact of YbjP deletion of the full E. coli proteome is commendable.

      Overall, this is a very solid piece of work, competently executed and presented, which significantly advances the field.

      Weaknesses:

      None obvious, however the presentation and especially main-text illustrative material seems to focus disproportionately on MacAB-TolC-YbjP complex, and the AcrABZ-TolC-YbjP is relegated to supplementary data which is somewhat confusing. There is no high-resolution side view of the AcrABZ-TolC-YbjP side-by-side to MacAB-TolC-YbjP which may be helpful to spot parallels and differences in the organisation of the two systems.

      Supplementary Figure 2 may also be better presented in the main text, as it shows specific displacements of residues upon binding of the YbjP relative to the apo-complexes, although this can be left at the authors' discretion.

    3. Reviewer #2 (Public review):

      This article focuses on the study of two E. coli tripartite efflux pumps both using TolC as partner in the outer membrane, namely MacAB-TolC and AcrABZ-TolC.

      By preparing MacAB-TolC in Peptidiscs rather than in detergent for cryo-EM structure determination, they visualized an extra protein localized around TolC. The resolution was sufficient to build part of the structure, and using the AlphaFold2 database and DALI topology recognition program, they identified it as the lipoprotein YbjP. This protein has an anchorage in the outer membrane, and it was suggested that it could act as a support for TolC that is the only OMF that does not have an N-terminal extension anchored in the outer membrane, which is very puzzling for the community working in this field of research.

      Authors used a large number of different approaches to evaluate the importance of YbjP (structure, genomic evolution, microbiology, photocrosslink in vivo, proteomic profile), but did not succeed in finding it a clear role so far, even if it could be important depending on environmental stress. Nevertheless, their results are of main interest for the comprehension of the complexity of such systems and deserve publication.

      The different analyses are properly performed and presented, and support the conclusions.

      My only concern is for the photocrosslink presented in Figures 3 and S3. My impression is that the bands do not migrate at the proper size after the crosslink.

      A second point that could be discussed further is the comparison of the structure of the pump in the presence of the peptidoglycan with the images previously obtained by tomography. It is not totally clear to me if YbjP could have been positioned in these maps.

    1. eLife Assessment

      This useful study presents a new method to identify the activity of single motor units from intramuscular EMG recordings. Validation against state-of-the-art techniques is limited to a small sample of simulated motor units; consequently, the evidence supporting the method's accuracy remains incomplete. The manuscript would be significantly strengthened by using more unbiased simulations for validation, validating the method with experimental datasets, comparing it against more recent techniques, and investigating how muscle physiology impacts accuracy.

    2. Reviewer #1 (Public review):

      Summary:

      The authors introduce EMUsort, an open-source algorithm for the automatic decomposition of high-resolution intramuscular EMG recordings. The method builds upon the Kilosort4 framework and incorporates modifications designed to better handle the spatial and temporal characteristics of intramuscular signals. The performance of EMUsort is evaluated on openly available datasets and compared against KS4 and MUEdit, demonstrating improved motor unit accuracy.

      Strengths:

      (1) The manuscript is clearly written, technically detailed, and well structured.

      (2) The open-source software is thoroughly documented, both within the manuscript and in the accompanying repository README, facilitating adoption by the community.

      (3) The availability of both code and datasets is a major strength, enabling reproducibility and independent validation.

      (4) The authors provide quantitative comparisons with existing decomposition algorithms, which is essential for contextualizing the proposed method.

      (5) The methodological details are sufficiently described to allow replication and further development by other researchers.

      Weaknesses:

      While the manuscript is strong overall, I have several suggestions that could further strengthen its impact and clarity.

      (1) Benchmarking and community integration

      A recent work has proposed standardized datasets and benchmarking pipelines for high-density surface EMG decomposition ("MUniverse: A Simulation and Benchmarking Suite for Motor Unit Decomposition", Mamidanna*, Klotz*, Halatsis* et al, NeurIPS 2025). A similar effort for intramuscular EMG would be highly valuable to the field. The authors may consider discussing how their dataset and algorithm could be integrated into broader benchmarking initiatives (e.g., platforms such as MUniverse), enabling systematic comparisons across multiple datasets and decomposition methods.

      (2) Comparison with additional decomposition algorithms

      Since the manuscript compares EMUsort with MUEdit, it would be appropriate to also include a comparison with Swarm-Contrastive Decomposition (SCD), which has been proposed for both surface and intramuscular EMG signals. Including this comparison, or explicitly discussing why it was not feasible, would strengthen the positioning of EMUsort relative to the current state of the art.

      (3) Manual editing and post-processing

      In practical EMG decomposition workflows, manual inspection and editing of motor units are often required after automatic decomposition. It would be useful for readers to know whether EMUsort provides (or is compatible with) a graphical interface or workflow for manual refinement, or how the authors envision this step being handled.

      (4) Ablation analysis of algorithmic modifications

      EMUsort is described as an extension of Kilosort4. An ablation analysis examining the impact of the main modifications introduced relative to KS4 would help clarify which changes contribute most to the observed performance improvements and under which conditions.

      (5) Failure modes and limitations

      A more explicit discussion of when EMUsort is likely to fail or degrade in performance would be valuable. For example, sensitivity to the number of channels, recording duration, signal quality, or motor unit density could be discussed to guide users.

      (6) Generalisability to surface EMG

      Given the shared methodological foundations between surface and intramuscular EMG decomposition, it would be helpful to know whether EMUsort has been tested on high-density surface EMG datasets or whether the authors expect limitations when applied outside the intramuscular domain.

      (7) Applicability to human intramuscular recordings

      The authors could clarify whether EMUsort has been tested on human intramuscular EMG, and discuss any expected differences in performance due to anatomical or physiological factors.

      (8) Parameter sensitivity

      Clustering-based methods can be sensitive to parameter choices. Reporting a parameter sensitivity analysis, or at least discussing the robustness of EMUsort to parameter variations, would increase confidence in the method's reliability and ease of use.

      (9) Differences between template matching and BSS methods

      Since the manuscript proposes a new template matching algorithm, but it compares its performance with a BSS one (MUedit), BSS algorithms should be described in the introduction. The differences between the methodologies should be highlighted, and the pros and cons of each described.

      Conclusion:

      The authors largely achieve their stated aims, and the results mostly support the main conclusions. EMUsort represents a meaningful contribution to the EMG decomposition literature, particularly for researchers working with high-resolution intramuscular recordings. With additional clarification regarding benchmarking, algorithmic ablations, and limitations, the manuscript would be further strengthened and likely to have a substantial impact on the field.

    3. Reviewer #2 (Public review):

      Summary:

      This work presents a new spike sorter, EMUsort, to target the challenging task of spike sorting Motor Unit Action Potentials (MUAP). EMUsort is essentially a modified version of Kilosort, with some key extensions to target EMG data: correct for large delays due to propagation across channels, spike detection of highly overlapping and large units via multiple thresholds, an increased number of waveform templates for spike detection, and an extended representation of waveforms to grasp complex MUAP spike shapes. The results on simulated data show solid evidence that the applied modifications make a difference for EMG recordings. All in all, I believe that EMUsort will greatly improve spike sorting performance for high-density EMG data.

      Strengths:

      The manuscript is well written, and the methods and modifications to the Kilosort pipeline are well-motivated, well-explained, and clear. The simulation results provide strong evidence that the presented modifications make spike sorting of high-density EMG data more accurate.

      Weaknesses:

      The method is overall only validated on 15 simulated motor units. The monkey dataset, in particular, seems too "easy" and not challenging enough to highlight weaknesses of any of the spike sorters. A second weakness is in the distribution of the code, which is shipped with submodules for Kilosort and SpikeInterface, and makes it hard to maintain long-term, and pins to old versions of these key dependencies.

    4. Reviewer #3 (Public review):

      Summary

      This paper introduces EMUsort, an extension of Kilosort4 designed to sort motor unit action potentials from high-density intramuscular EMG recordings. Using rat and monkey forelimb recordings, the authors generate realistic simulated datasets with known ground truth and demonstrate that EMUsort substantially outperforms Kilosort4 and MUedit, particularly during periods of high motor unit overlap.

      Strengths

      This is a timely study in light of recent advances in intramuscular muscle recording technologies and the growing interest in automated methods for decoding neural and neuromuscular signals. The work leverages state-of-the-art electrode arrays and combines them with advanced signal processing tools to address a challenging and relevant problem in motor unit analysis.

      Weaknesses

      There are several aspects of the study that substantially limit the interpretation of the main results and conclusions. The following major points should be carefully considered by the authors.

      (1) Choice of experimental model and validation framework: The study aims to validate a new methodology for EMG decomposition, yet the rationale for the chosen experimental models is unclear. Specifically, it is not evident why the authors focused on intramuscular recordings from two animal models performing dynamic tasks. Given the extensive literature on the development and validation of EMG decomposition methods, the choice of an experimental design that substantially deviates from established approaches is insufficiently justified. In particular, it is unclear why the authors did not consider more standard validation paradigms based on (i) isometric contractions, (ii) human data, (iii) surface EMG recordings, or (iv) combinations of their recording technologies with previously validated motor unit identification methods. This methodological divergence makes it difficult to interpret the findings in the context of existing evidence.

      (2) Lack of manual EMG decomposition as reference: Related to the previous point, it is unclear why standard manual EMG decomposition methods were not used to generate reference datasets for validation. Manual decomposition has been shown to be reliable under specific conditions (low contraction levels, slow dynamics, etc.) and would have substantially strengthened the validation of the proposed algorithm.

      (3) Neglect of muscle deformation effects: While the manuscript discusses several factors that complicate EMG decomposition relative to brain recordings, it does not address the well-known effects of muscle deformation during contractions on motor unit action potential shapes. There is extensive literature demonstrating that dynamic muscle contractions lead to systematic changes in action potential morphology, representing a major challenge for EMG decomposition and a fundamental difference from brain recordings. Additionally, even small relative movements of intramuscular electrodes can produce waveform changes that are large relative to muscle fiber dimensions. These issues are particularly relevant given the highly dynamic tasks studied here (e.g., treadmill walking in rats), yet they are not discussed or incorporated into the analysis.

      (4) Exclusive reliance on simulated data for validation: The primary validation of EMUsort is based on simulated data, which represents a major limitation of the study. This reliance should be clearly and explicitly stated in the abstract, introduction, and discussion. Moreover, the simulation approach itself raises concerns. The simulated EMG signals are generated using templates derived from the same sorting framework being validated, which introduces a potential methodological bias. The linear combination of components used to synthesize waveforms constitutes an unjustified modeling assumption that may favor template-based approaches such as EMUsort. Additionally, the spike time generation procedure appears unnecessarily complex and insufficiently justified. Previous validation studies typically modeled motor units as firing at relatively stable levels along their recruitment curves, producing long spike trains with pseudo-random relative timing and diverse overlap conditions. This framework would likely provide a more robust and interpretable validation. If the authors believe their simulation approach is superior, a stronger justification is required. Finally, the limited number of simulated motor units is difficult to reconcile with the expected level of motor unit recruitment during the studied behaviors, and this choice is not adequately justified.

      (5) Incomplete reporting and visualization of experimental data: The manuscript would benefit from a clearer description of the number of rats and monkeys used, which should be reported explicitly in the abstract. In addition, visualizations of the raw multichannel EMG data across different task phases and activation levels would substantially improve transparency. Providing comprehensive visualizations of motor unit action potential shapes across all channels and identified units (for both rats and monkeys) would also help readers assess the spatiotemporal features that underpin unit identification and sorting reliability.

      (6) Physiological limitations of conduction delay correction: The proposed method for correcting conduction delays across channels is physiologically suboptimal. First, motor unit conduction velocities differ substantially across units, implying that delay correction should be applied at the unit level rather than uniformly across channels. Second, conduction delays depend on fiber orientation and distance relative to electrode geometry; if fibers are oriented at different angles with respect to the array, a single delay correction becomes invalid. Additionally, the schematic in Figure 2A appears to contradict the proposed correction approach: if electrode threads are arranged perpendicular to muscle fibers, conduction delays across channels within a single thread should be minimal.

      (7) Clarity issues in Figure 4: Figure 4 (panels A-D) is potentially misleading. It should be clearly stated whether the signals shown are artificial examples or derived from real recordings; ideally, real data should be used to illustrate the advantages of dynamic thresholds. In panels B-D, the depiction of overlapping action potentials is difficult to interpret due to the thickness of the traces, and it is unclear whether overlaps with neighboring action potentials are absent by design or expected to occur in real data. If overlaps are expected, one would also expect to observe contamination in the extracted waveforms, which is not evident in the figure.

      (8) Concerns regarding method comparisons: The comparison with existing methods raises methodological concerns. It appears that EMUsort was carefully optimized, whereas the competing algorithms were not equivalently fine-tuned. The literature clearly shows that EMG decomposition performance depends strongly on adapting algorithms to the signal type (intramuscular vs. surface, species, electrode geometry). Furthermore, it is surprising that MUedit is reported to perform particularly poorly during periods of motor unit overlap, as blind source separation methods were explicitly developed to handle convolutive mixtures and overlapping sources, especially in surface EMG (which is an extreme case of motor unit overlapping). This discrepancy requires further explanation.

      (9) Insufficient characterization of motor unit firing properties: The study does not provide sufficient information about the firing characteristics of the identified motor units in experimental data. Relevant metrics that should be reported include average, minimum, and maximum firing rates; coefficients of variation of discharge rate; signal-to-noise ratios of motor unit action potentials; potential evidence of motor unit rotation over time; and stability of firing behavior across recording intervals.

      (10) Lack of theoretical framing: Given the scope and claims of the paper, it would be valuable to include a more theory-driven introduction explaining why different sorting approaches (e.g., template matching vs. blind source separation) may be more or less suitable depending on the nature of the recorded signals. A clearer conceptual rationale for why the proposed approach is expected to outperform existing methods would substantially strengthen the manuscript.

      (11) Limitations of validation metrics: Some of the metrics used to evaluate performance are not ideal. For example, reporting 0% accuracy for a unit is misleading and should instead be described as a failure to identify that unit. Similarly, comparisons of total spike counts are of limited interpretive value and may be misleading, as correct spike counts do not necessarily imply correct spike identities.

      (12) Clarification of computational performance claims: Finally, the discussion of computation times should clarify that some existing methods require substantial time for offline estimation of projection vectors but can operate in near real time once these vectors are learned and remain stable. This distinction is important for a fair comparison of practical usability.

    1. eLife Assessment

      This modeling study proposes important new insights into the circuit mechanisms underlying navigational control in insects. High-speed video recordings of ants are compared to detailed predictions from a new computational model, whose description is incomplete. If the model is sound, the similarities between the model and behavioral data suggest how complex behavioral motifs can emerge from a simple neural circuit. These results will be of interest to scientists studying the neural circuit basis of behavior, particularly in insects.

    2. Reviewer #1 (Public review):

      Summary:

      Freas and Wystrach present a computational model of steering in insects. In this model, the central complex provides an error signal indicating the animal should turn left or right; this error signal biases the function of an oscillator composed of two mutually inhibiting self-exciting units. The output of these units generates a "steering signal" that is used both to set the direction and speed of the ant. Additionally, a separate module induces pauses, and an inverse relation between forward speed and turning speed is externally imposed. Statistics of the trajectories generated by the model are compared to the measured behaviors of ants.

      Strengths:

      While the model is very simple compared to state-of-the-art models, that simplicity makes it a potentially useful guide to researchers studying insect navigation. Some predictions that emerge from the model appear to be experimentally testable, although a more complete description of the model and its parameters, as well as an analysis of how this model's predictions differ from previous models' predictions, would be required to design these experiments.

      Weaknesses:

      I found it difficult to identify evidence in the paper supporting central elements of the abstract. Hopefully, these difficulties can be resolved with a clearer presentation and the addition of supporting detail, especially in the methods.

      (1) The model is not clearly described

      In the Materials and Methods, there is no description of the model, just "The computational model is presented in Figure 1." (This is probably a typo and may refer to Figure 2A-C), and a link to Matlab source code. It is inappropriate to ask readers or reviewers to examine source code in lieu of providing a method, but I attempted to do so anyway. To my eye, the source code does not match the model presented in 2A-C. For instance, in 2C, "Steering signal" inhibits "Freeze", but I couldn't find this in the source. "Freeze" is shown to inhibit "steering signal," but as "steering signal" is a signed quantity, it's not clear what this means. Literally, since "ang_speed_raw = L-R," it would seem to indicate the "freeze" would bias towards right turns. In the code, "freeze" appears to be implemented through the boolean variable "speed_inhibition_time." The logic controlled by this variable doesn't appear to inhibit the "steering signal" but instead (depending on control parameters) either reduces the movement speed and amplifies the turning rate, or it turns the angular speed output into a temporal integral of the control signal.

      There are a number of parameters in the source code that aren't described at all in the paper, including the internal oscillator parameters.

      Together, these limitations make it difficult to understand what is being simulated, what parts of the model are tied to biology, and where the model improves on or departs from previous work.

      It is absolutely essential that authors fully describe the computational model, that they explain the meaning of all parameters of the model, and that they explain how the particular values of these parameters were chosen.

      (2) The biological inspiration is unclear

      A central claim of the paper is that the model is "biologically grounded." But some elements, for instance, using a signed quantity to represent left-right steering drive, are not biologically possible; at best, these are shorthand for biologically possible implementations, e.g., opposing groups of left-right driving neurons.

      The mechanism that produces fixations and saccades - the "freeze" module - is not tied to any particular anatomy of the insect brain. Initiation of a freeze occurs at a specific time coded into the model by the authors; it is not generated by an internal model signal. Release of a freeze is by drawing a random variable; there is no neural mechanism proposed to generate this signal.

      In some versions of the model, instead of directly controlling the signal, during fixations, the angular drive signal is integrated into a variable "cumul_drive." No neural substrate is proposed for this integrator. In the code, if cumul_drive passes a threshold, the angular heading of the ant changes (saccades), but only if this threshold is passed before the Poisson process ends the fixation. No neural substrate is proposed for any of this logic.

      The model steps forward in time by a fixed increment - the actual duration (in seconds) of this time step is not specified. From Figure 4F, G, it appears a simulation time step is meant to be about 10ms. This would imply an oscillator frequency of about 2 Hz (Fig 2B), that the heading oscillates at a similar frequency (2G), and that a forward crawling ant stops moving every 500 ms (2I). Are these plausible? Can they be compared to an experiment?

      Model parameters, including the ones that control the frequency of the oscillator, are non-dimensionalized. It is not possible to evaluate whether these parameters are biologically plausible or match experimental results.

      (3) Claims that behaviors emerge from the model may be overstated

      The abstract claims that steering correction and fixations/saccades emerge naturally from the same model. But it appears to me that fixations/saccades are externally imposed by the specification of specific times for a "freeze." Faster angular rotation during saccades than during course correction is imposed and does not emerge naturally from neural simulations.

      (4) Citations to previous literature are difficult to follow, and modeling results are presented as though they are experimental data

      I would ask the authors to be much clearer in their description and citation of previous work. It should be clear whether the cited work was experimental or computational. To the extent possible, the actual measurement should be described succinctly. Instead of grouping references together to support a sentence with multiple claims, references should be cited for each claim. Studies of computational models should not be presented as proving a biological result.

      For example:

      a) Lines 141-146:<br /> "Previous studies have established many key components of insect navigation, including .... the intrinsic oscillatory dynamics in the lateral accessory lobes (LALs) that support continuous zigzagging locomotion (Clément et al., 2023; Kanzaki, 2005; Namiki and Kanzaki, 2016; Steinbeck et al., 2020)."

      The first reference is to one author's previous modeling work - it hypothesizes that oscillations in the LAL support zigzagging but includes no data that would "establish" the fact. Kanzaki et al. 2005 describes numerical modeling and simulation with a physical robot. Namiki and Kanzaki, 2016 is a review article that links the LAL to zigzagging behavior. It describes the LAL as a winner-take-all bistable network but does not describe or hypothesize that the LAL has intrinsic oscillatory dynamics. Steinbeck et al. 2020 is a more comprehensive review; it reinforces that the LAL is a winner-take-all bistable network that drives left-right steering, including during zig-zagging behavior. But in my reading, I could not find a statement that the LAL has intrinsic oscillatory dynamics (the closest is Steinbeck et al. saying the activity pattern switches regularly, as does the behavior; this doesn't imply that the LAL is intrinsically oscillatory.)

      b) Lines 701-703:<br /> "In plume-tracking moths, CX output has been shown to modulate LAL flip-flop neurons driving zigzagging (Adden et al., 2022)."

      This reads as though an experimental measurement was made, but in fact, this is modeling work.

      c) Lines 703-706:<br /> "In ants, strong goal signals in the CX - whether elicited by the path integrator or visual familiarity (Wehner et al., 2016; Wystrach et al., 2020b, 2015) do not only sharpen directional accuracy but also increase oscillation frequency (Clément et al., 2023)."

      Here again, modeling results are presented as though they were experimental data.

    3. Reviewer #2 (Public review):

      Summary:

      The paper by Freas and Wystrach is an interesting computational study, exploring the detailed mechanisms of how simple neural circuits could explain complex behavioral patterns observed in navigating ants. The authors compare detailed, high-speed video recordings of Australian desert ants (Melophorus bagoti) with predictions made by their new computational model and find convincing similarities between the model and the behavioral data, at a level of detail not previously studied. Particularly interesting are emerging properties of the model, yielding behavioral motifs it was not designed to reproduce, but which occur in natural ant behavior.

      Strengths:

      A strength of the study is that the model is based on previous models, without making major novel explicit assumptions. It combines existing models of the insect central complex with a model of the lateral accessory lobe and adds a stochastic inhibition of forward velocity to the interaction of central complex and lateral accessory lobes. The central complex provides corrective steering signals when the goal direction and the current heading of an insect are not aligned, while the lateral accessory lobes provide an intrinsic oscillator underlying the behavioral oscillations shown by walking ants at all times. These background oscillations are modulated by the steering signals from the central complex. Depending on which phase of the intrinsic oscillations coincides with the corrective signals, and how fast the ant is moving forward during this time, a complex set of behaviors emerges. Most prominently, scanning behaviors, which are regularly carried out by the ants, are recapitulated in great detail by the model. Additionally, other behaviors, such as full loops, emerge naturally from the model. While computational models are not to be seen as definite evidence for any biological reality, they can provide strong support for particular neural implementations. The current study is an excellent example in that it provides evidence for a serial arrangement of central complex circuits upstream of the lateral accessory lobe circuits, modulated by speed-regulating input. While the latter is hypothetical, it yields a clear hypothesis that can be validated by connectomics studies and functional work in the future.

      The study shows that even complex behavioral motifs do not require dedicated neural modules, but can rather emerge from the interplay of already known circuits - highlighting the efficiency of insect brains and possibly providing the path towards embodied hardware solutions of such circuits in autonomous agents.

      Weaknesses:

      There are several weaknesses in the paper as it is.

      Firstly, the model is not described in the methods, but only found when following the link to the authors' GitHub repository. This is clearly not sufficient and prevents readers from evaluating the model's assumptions directly. Most importantly, how natural do the emerging properties indeed emerge from the model? What parameters need to be tuned to generate a match between data and model?

      Second, it is often not entirely clear what is biological data and what is a computational model. This relates to figures, text, and references. As a reader, this makes it difficult to clearly judge what is new in the current paper, how it adds to previous models, and what the predictions and assumptions are for biology.

      Third, while neural data from bees and flies are taken to motivate and design the computational model, the discussion and interpretation revolve almost exclusively around ants. For the most part, this is justified, as the behavioral data used to benchmark the model are taken from ants. Nevertheless, more broadly discussing the newly defined circuit in the context of flying insects would give a better idea of the broad relevance of the neural circuits predicted by the model.

    1. eLife Assessment

      This study provides important insights into the crosstalk between ATG2A with components of the early secretory pathway. While the mechanisms governing autophagic membrane expansion remain yet to be fully understood, in this study the authors employ an elegant proximity labelling approach and identify two ER-Golgi intermediate compartment (ERGIC)-localized proteins. Through a series of complementary experiments combining microscopy and biochemistry, the authors identify ARFGAP1 and Rab1A as components of early autophagic membranes, which accumulate at the periphery of pre-autophagosomal structures induced by loss of ATG2. The overall study is well executed and the evidence supporting the claims is convincing.

    2. Reviewer #1 (Public review):

      Summary:

      D. Fuller et al. set out to study the molecular partners that cooperate with ATG2A, a lipid transfer protein essential for phagophore elongation, during the process of autophagy. Through a series of experiments combining microscopy and biochemistry, the authors identify ARFGAP1 and Rab1A as components of early autophagic membranes, which accumulate at the periphery of aberrant pre-autophagosomal structures induced by loss of ATG2. While ARFGAP1 has no apparent function in autophagy, the authors show that RAB1A is implicated in autophagy, although the precise mechanisms are not explored in the manuscript.

      Strengths:

      The work presented by Fuller et al. provides new insights into the composition of early autophagic membranes. The authors provide a series of MS experiments identifying proteins in close proximity to ATG2A, which is a valuable dataset for the field. Furthermore, they show for the first time the interaction between ATG2A and RAB1A both in fed and starved conditions, which extends the characterisation of the pre-autophagosomal structures observed in ATG2 DKO cells.

      Weaknesses / Specific comments:

      (1) The authors claim that Rab1A/B knockdown phenocopies the LC3-II accumulation observed in ATG2 DKO cells. While LC3-II accumulation is consistent with this interpretation, depletion of many autophagy-related proteins can give rise to a similar phenotype, even when they function at distinct stages of the autophagic cascade. Therefore, LC3-II accumulation alone is insufficient to support phenocopying in my vew. Immunofluorescence analyses demonstrating comparable cellular phenotypes-such as membrane accumulation of pre-autophagosomal structures-following Rab1 knockdown should be provided. Moreover, p62 does not accumulate upon Rab1 depletion, suggesting that loss of Rab1 does not fully phenocopy ATG2 deficiency. Consequently, it remains unclear whether Rab1A depletion truly phenocopies ATG2A depletion with respect to autophagy progression or the accumulation of pre-autophagosomal structures.

      (2) Interpretation of the significance of the data

      (2.1) The significance statement asserts that "this study elucidates the role of early secretory membranes in autophagosome biogenesis." While the data convincingly demonstrate an association between the RAB1A GTPase and ATG2A, the study does not provide mechanistic insight into how this interaction functionally contributes to autophagy. As presented, the findings support a correlative relationship rather than a defined role in autophagosome biogenesis.

      (2.2) The title states that ATG2A "engages" Rab1A- and ARFGAP1-positive membranes during autophagosome formation. However, both Rab1A and ARFGAP1 are shown to localize to pre-autophagosomal structures independently of ATG2A. In the absence of evidence demonstrating a functional or causal dependency, the term "engages" appears overstated. A more descriptive term, such as "associates," would more accurately reflect the data.

      (2.3) In the Discussion, the authors state that previous studies have reported increased LC3-II levels following knockdown of Rab1 proteins (refs. 38 and 49). However, it is unclear where this observation is documented in the cited references.

      (3) Some concerns remain in specific figures, as outlined below:<br /> • Quantification is missing in Fig S2D.<br /> • The authors claim: "siRNA against ARFGAP1 had very little effect" but the quantification and blots show actually no effect.<br /> • Conclusions drawn from KD experiments in Fig. S2 should be interpreted with caution, as knockdown efficiency is very low, particularly for ARFGAP1/3 in the triple knockdown.<br /> • In New Fig. 4, the representative blot is not representative of the results showed in the quantification as previously noted.

    3. Reviewer #2 (Public review):

      The mechanisms governing autophagic membrane expansion remain incompletely understood. ATG2 is known to function as a lipid transfer protein critical for this process; however, how ATG2 is coordinated with the broader autophagic machinery and endomembrane systems has remained elusive. In this study, the authors employ an elegant proximity labeling approach and identify two ER-Golgi intermediate compartment (ERGIC)-localized proteins-Rab1 and ARFGAP1-as novel regulators of ATG2 during autophagic membrane expansion.

      Their findings support a model in which autophagosome formation occurs within a specialized subdomain of the ER that is enriched in both ER exit sites (ERES) and ERGIC, providing valuable mechanistic insight. The overall study is well executed and offers an important contribution to our understanding of autophagy. I support its publication in eLife and offer the following minor comments for clarification and improvement.

      Specific Comments

      (1) Integration with Prior Literature<br /> The data convincingly implicate the ERES-ERGIC interface in autophagosome biogenesis. It would strengthen the manuscript to discuss previous studies reporting ERES and ERGIC remodeling and formation of ERERS-ERGIC contact sites (PMID: 34561617; PMID: 28754694) in the context of the current findings.

      (2) Figure Labeling<br /> The font size in Figure 1A and Supplementary Figure S1G is too small for comfortable reading. Please consider enlarging the labels to improve clarity.

      (3) Experimental Conditions<br /> In Figures 2A-C and Figure 4, it is unclear how the cells were treated. Were they starved in EBSS? Please include this information in the corresponding figure legends.

      (4) LC3 Lipidation vs. Cleavage<br /> In Figure 2A, ARFGAP1 knockdown appears to reduce LC3 lipidation without affecting Halo-LC3 cleavage. Clarifying this observation would help readers better understand the functional specificity of ARFGAP1 in the pathway.

      (5) Use of HT-mGFP in Figure 2C<br /> Please clarify why the assay in Figure 2C was performed in the presence of HT-mGFP. Explaining the rationale would aid interpretation of the results.

      (6) FIB-SEM Imaging<br /> For the FIB-SEM images in Figures 3 and S3, directly labeling the cellular structures in the images would greatly facilitate interpretation for the reader.

      (7) Supplementary Figures<br /> Many of the supplemental figures are high quality and contain key data. If space permits, I suggest moving these into the main figures. In particular, the FLASH-PAINT experiment could be presented as part of Figure 1.

      (8) Text Revision for Clarity<br /> In line 242, the phrase "but protein-protein interactions appear to be limited to RAB1" would benefit from clarification. A more precise formulation could be: "but stable protein-protein interactions appear to be limited to RAB1."

      (9) COPII Inhibition Strategy<br /> The authors used the dominant-active SAR1(H79G) mutant to inhibit COPII function. While this is effective in in vitro budding assays, the GDP-locked mutant SAR1(T39N) has been shown to be more effective in blocking COPII-mediated trafficking in cells. Including SAR1(T39N) in the analysis would provide stronger support for the conclusions.

    4. Reviewer #3 (Public review):

      The manuscript by Fuller et al describes a crosstalk between ARTG2A with components of the early secretory pathway, namely RAB1A and ARFGAP1. They show that ATG2A is recruited to membranes positive for RAB1A, which they also show to interact with ATG2A. In agreement with earlier findings by other groups, silencing RAB1A negatively affects autophagy. While ARFGAP1 was also found on ATG2A positive membranes, silencing ARFGAP1 had no impact autophagy. Notably, these ARFGAP1 positive membranes are not Golgi membranes.

      The findings are interesting and the data are in general of good quality. I think the story is good enough to be published in eLife and I have the following questions, which the authors may attend to:

      (1) Are the membranes to which ATG2A is recruited a form of ERGIC?

      (2) Figure 3A/B: Is it possible to show a better example? The difference is barely detectable by eye. Since Immunoblotting is not really a quantitative method, I think that such a weak effect is prone to be wrong. Is there another tool/assay to validate this result?

      (3) Is the curvature-sensitive region of ARFGAP1 required for its co-localization with ATG2A?

      (4) What does Rab1A do? What is its effector? Or does the GTPase itself remodel the membrane?

      (5) What about Arf1? It appears that this role of ARFGAP1 is unrelated to Arf1 and COPI? Thus, one would predict that Arf1 does not localize to these structures and does not affect ATG2A function

      (6) Does ARFGAP1 promote fission of the membrane from its donor compartment?

      (7) What are ARFGAP1 and Rab1A recruited to? What is the lipid composition, or protein that recruits these two players to regulate autophagy?

      Comments on the latest version:

      The revisions carried out by the authors are fine. The new data on ArfGAP1 and about the indirectness of the ATG2A and Rab1A interaction improve both clarity and strength of the manuscript. I have no further comments.

    5. Author response:

      The following is the authors’ response to the original reviews.

      We thank the reviewers and editors for their thoughtful comments, which substantially improved the quality and clarity of our manuscript. We have attempted to address each major concern with either new experiments or significant textual revisions.

      Reviewer 1 noted that “this research is conducted exclusively in HEK293 cells… including at least one additional cell line would significantly strengthen the main findings.” To directly address this concern, we repeated our RAB1A/B double-knockdown experiments in H4 neuroglioma cells, which endogenously express a tandem fluorescent-tagged LC3B reporter. Using flow cytometry to quantify autophagic flux, we confirmed that RAB1 depletion in H4 cells recapitulates the flux defects observed in HEK293 cells, thereby validating the generality of our findings across distinct lineages.

      To validate the robustness of the ATG2 DKO phenotype and the localization of ARFGAP1-positive membranes, we acquired an ATG2 double knockout HeLa cell line. We confirmed the presence of the characteristic large ATG2-deficient PAS compartment in HeLa cells, and the recruitment of ARFGAP1 membranes, but note that ARFGAP1 displays a solid distribution through the compartment in these cells, in contrast to the more peripheral enrichment observed in HEK293 cells. These data are now included and discussed in the revised manuscript.

      Multiple reviewers asked for greater clarity around the interaction between ATG2A and RAB1A. Although our original data showed that these proteins co-immunoprecipitate in cells, we had not established whether their association was direct. In response, we attempted in vitro co-immunoprecipitations from purified components.  As we could not detect interactions in this simplified system, we now speculate that the ATG2A–RAB1A interaction is indirect. This clarification is now incorporated into the results section.

      Multiple reviewers also raised questions regarding the nature of the membranes recruiting ARFGAP1 and the potential relationship to Arf1 and Golgi trafficking. In particular, Reviewer 3 asked: “(5) What about Arf1? … one would predict that Arf1 does not localize to these structures and does not affect ATG2A function.” To examine whether ARFGAP1 recruitment depends on Golgi integrity or Arf1-regulated trafficking, we perturbed the Golgi using three mechanistically distinct methods: Brefeldin A, mitotic entry, and SidM expression, each of which dissolves Golgi architecture. In each condition, ARFGAP1 localization to the enlarged PAS compartment in ATG2 DKO cells was unchanged. These results indicate that ARFGAP1 recruitment is independent of Golgi structure and provide indirect support for the notion that Arf1 does not participate in this process. Reviewer 3 also asked: “Is the curvature-sensitive region of ARFGAP1 required for its co-localization with ATG2A?” To address this, we generated ARFGAP1 mutants lacking either GAP catalytic activity or the ALPS curvature-sensing domain. When expressed in ATG2 DKO cells, all mutants retained full recruitment to the PAS compartment. Thus, neither GAP activity nor ALPS-mediated curvature sensing is required for ARFGAP1 localization in this context.

      Response to Reviewer 3 -“(2) Figure 3A/B: … is there another tool/assay to validate this result?”—we quantified autophagic flux following SAR1B(H79G) overexpression using the flow-cytometry tandem-fluorescent LC3 assay. These experiments confirmed that SAR1B(H79G) causes only a modest reduction in autophagic flux, consistent with partial inhibition of COPII, thereby supporting our original interpretation.

      We also took steps to improve the integration of our findings with prior literature. Reviewer 2 requested that we strengthen the manuscript by incorporating studies on ERES–ERGIC remodeling (“It would strengthen the manuscript to discuss previous studies…”). We now cite and discuss the studies corresponding to PMIDs 34561617 and 28754694, aligning our observations with mechanistic models of early secretory pathway remodeling. More broadly, Reviewer 1 commented that our discussion “overlooks some important aspects,” and Reviewer 3 asked, “Are the membranes to which ATG2A is recruited a form of ERGIC?” In response, we substantially rewrote the discussion, expanding our integration of existing literature and explicitly addressing models in which ATG2A acts at an ERGIC-derived membrane.

    1. eLife Assessment

      This study presents valuable findings on the ability of a state-of-the-art method, Temporally Delayed Linear Modelling (TDLM), to detect the replay of sequences in human memory. The investigation provides compelling evidence that TDLM has significant limitations in its sensitivity to detect replay in extended (minutes-long) rest periods. The work will be of strong interest to researchers investigating memory reactivation in humans, especially using iEEG, MEG, and EEG.

    2. Reviewer #1 (Public review):

      Summary:

      Participants learned a graph-based representation, but, contrary to the hypotheses, failed to show neural replay shortly after. This prompted a critical inquiry into temporally delayed linear modeling (TDLM)--the algorithm used to find replay. First, it was found that TDLM detects replay only at implausible numbers of replay events per second. Second, it detects replay-to-cognition correlations only at implausible densities. Third, there are concerning baseline shifts in sequenceness across participants. Fourth, spurious sequences arise in control conditions without a ground truth signal. Fifth, the revised manuscript adapts a previously published synthetic simulation to show that previous validations/support of TDLM may have overestimated TDLM sensitivity because synthetic assumptions can produce unrealistically high pattern separability and reduced baseline confounds.

      Strengths:

      - This work is meticulous and meets a high standard of transparency and open science, with preregistration, code and data sharing, external resources such as a GUI with the task and material for the public.

      - The writing is clear, balanced, and matter-of-fact.

      - By injecting visually evoked empirical data into the simulation, many surface-level problems are avoided, such as biological plausibility and questions of signal-to-noise ratio.

      - The investigation of sequenceness-to-cognition correlations is an especially useful add-on because much of the previous work uses this to make key claims about replay as a mechanism.

      - In the revised version, the authors foreshadow ways to improve sequenceness detection by introducing a sign-flipping analysis.

      Weaknesses:

      Many of the weaknesses are not so much flaws in the analyses, but shortcomings when it comes to interpretation and a lack of making these findings as useful as they could be. Furthermore, as I will explain below, some weaknesses have been partially improved in the last round of revisions.

      - I found the bigger picture analysis to be lacking, though improved in the latest version. Let us take stock: in other work during active cognition, including at least one study from the Authors, TDLM shows significant sequenceness. But the evidence provided here suggests that even very strong localizer patterns injected into the data cannot be detected as replay except at implausible speeds. How can both of these things be true? Assuming these analyses are cogent, do these findings not imply something more destructive about all studies that found positive results with TDLM? In the revisions, the manuscript concentrates a bit more on criteria that influence detection of sequences, though it is still not entirely clear what consequences there are for previous work.

      - All things considered, TDLM seems like a fairly vanilla and low assumption algorithm for finding event sequences. Although the authors have improved their discussion of "boundary conditions" or factors for why TDLM might fail, it remains not fully clear to what extent the core problem is TDLM on an algorithmic/mathematical level (intrinsic factor), vs data quality, power, window size (extrinsic factors).

      - The new sign-flip analysis underscores the authors' goal of being solution-oriented, though it is important to emphasize that a comprehensive way forward is not yet provided. This is fine, but the manuscript could be improved further through a concrete alternative or a revised version of the original approach.

    3. Reviewer #2 (Public review):

      Summary:

      Kern et al. investigated whether temporally delayed linear modeling (TDLM) can uncover sequential memory replay from a graph-learning task in human MEG during an 8 minute post-learning rest period. After failing to detect replay events, they conduct a simulation study in which they insert synthetic replay events, derived from each participants' localizer data, into a control rest period prior to learning. The simulations suggest that TDLM only reveals sequences when replay occurs at very high densities (> 80 per minute) and that individual differences in baseline sequenceness may lead to spurious and/or lacklustre correlations between replay strength and behavior.

      Strengths:

      The approach is extremely well documented and rigorous. The authors have done an excellent job re-creating the TDLM methodology that is most commonly used, reporting the different approaches and parameters that they used, and reporting their preregistrations. The hybrid simulation study is creative and provides a new way to assess the efficacy of replay decoding methods, and its comparison to earlier published TDLM simulations is particularly useful. The authors remain measured in the scope/applicability of their conclusions, constructive in their discussion, and end with a useful set of recommendations for how to best apply TDLM in future studies. I also want to commend this work for not only presenting a null result, but thoroughly exploring the conditions under which such a null result is expected. I think this paper is interesting and will be generally quite useful for the field.

      In the revised version, the authors have adequately addressed each of the weaknesses I raised previously. In brief, they:

      (i) Added new power analyses of sequenceness for bootstrapped sample sizes, along with a new permutation test (Supplemental Fig 11),

      (ii) Qualified their conclusions with added limitations and clarified several points that I found previously unclear,

      (iii) Added several new analyses to the Appendices

      (iv) Demonstrated that previous simulations validating TDLM overestimated TDLM sensitivity relative to the hybrid simulation.

      (v) Added a new and extensive appendix on the relationship between TDLM and replay characteristics.

      Weaknesses:

      The remaining weaknesses of the work relate primarily to explaining the cause of measured non-random fluctuations in the simulated correlations between replay detection and performance at different time lags, as well as a lack of general recommendations of parameter choices for applying TDLM in future work. But these are minor weaknesses that can be left to future work.

    4. Reviewer #3 (Public review):

      Summary:

      Kern et al. critically assess the sensitivity of temporally delayed linear modelling (TDLM), a relatively new method used to detect memory replay in humans via MEG. While TDLM has recently gained traction and been used to report many exciting links between replay and behavior in humans, Kern et al. were unable to detect replay during a post-learning rest period. To determine whether this null result reflected an actual absence of replay or sensitivity of the method, the authors ran a simulation: synthetic replay events were inserted into a control dataset, and TDLM was used to decode them, varying both replay density and its correlation with behavior. The results revealed that TDLM could only reliably detect replay at unrealistically (not-physiological) high replay densities, and the authors were unable to induce strong behavior correlations. These findings highlight important limitations of TDLM, particularly for detecting replay over extended, minutes long time periods.

      Strengths:

      Overall, I think this is an extremely important paper, given the growing use of TDLM to report exciting relationships between replay and behavior in humans. I found the text clear, the results compelling, and the critique of TDLM quite fair: it is not that this method can never be applied, but just that it has limits in its sensitivity to detect replay during minutes long periods. Further, I greatly appreciated the authors efforts to describe ways to improve TDLM: developing better decoders and applying them to smaller time windows.

      The power of this paper comes from the simulation whereby the authors inserted replay events and attempted to detect them using TDLM. Regarding their first study, there are many alternative explanations or possible analysis strategies that the authors do not discuss; however, none of these are relevant if replayed, under conditions where it is synthetically inserted, cannot be detected.

      Further, the authors provide a simulation and series of analyses aimed at replicating previous TDLM-based replay studies. They demonstrate methodological flaws, and show that previous simulations greatly overestimated the sensitivity of TDLM. This work emphasizes the need to cast a critical eye over both past and future studies applying TDLM to detect replay.

      Finally, the authors are relatively clear about which parameters they chose, why they chose them, and how well they match previous literature (they seem well matched); and provide suggestions for how others can determine the best parameters for TDLM within their own experimental contexts.

      Comments on revisions:

      The authors thoroughly addressed my previous comments; the added analyses and discussion significantly strengthen the paper's clarity, utility, and impact.

    5. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      (1) I found the bigger picture analysis to be lacking. Let us take stock: in other work, during active cognition, including at least one study from the Authors, TDLM shows significance sequenceness. But the evidence provided here suggests that even very strong localizer patterns injected into the data cannot be detected as replay except at implausible speeds. How can both of these things be true? Assuming these analyses are cogent, do these findings not imply something more destructive about all studies that found positive results with TDLM?

      Our focus here is on advancing methodology. Given the diversity of tasks and cognitive states in the TDLM literature, replay could exceed detection thresholds under specific conditions—especially when true event durations align with short analysis windows. While a comprehensive re-analysis of prior datasets is beyond our scope, we agree a concise synthesis can strengthen the paper.

      The previous TDLM literature uses a diverse set of tasks and addresses a broad spectrum of cognitive constructs/processes. As we acknowledge, it is perfectly possible that replay bursts in short time windows are well detectable by TDLM. However, we acknowledge that some commentary on this is warranted and have added the following paragraph to the discussion that addresses “improving TDLMs sensitivity”:

      “Finally, what do our simulations imply for the broader MEG replay literature? Our implementation successfully detects replay when boundary conditions are met, as shown in the simulation. But sensitivity depends critically on high fidelity between the analysis window and the density of replay events. A systematic evaluation of these conditions as they apply to prior studies remains beyond the scope of the current paper. Instead, our focus is on delineating boundary conditions that we hope will motivate conduct of power analyses in future work as well as inclusion of simulations that approximate realistic experimental conditions.”

      (2) All things considered, TDLM seems like a fairly 'vanilla' and low-assumption algorithm for finding event sequences. It is hard to see intuitively what the breaking factor might be; why do the authors think ground truth patterns cannot be detected by this GLM-based framework at reasonable densities?

      We agree with the overall sentiment of the referee. Our intuition is that one of the principal shortcomings of the method relates to spurious sequenceness induced by unknown factors at baseline, and poor transfer of the decoder to other modalities. and have a rough understanding of how they occur, we are currently not in a position to identify their nature. Note that we believe that these confounders are not exclusive to TDLM but are potentially threatening to all kinds of sequenceness analysis of longer time series that rely on decoders. Indeed, we suspect that classifier training is another bottleneck, as we don’t know the exact nature of the representations that are replayed, including the degree of overlap there is with a commonly used visual localizer. That said, this is not of relevance for the simulation in so far as we insert patterns that exceed the pattern strength in the localizer.

      Finally, a potential major drawback is the permutation test for significance testing. As the original authors of TDLM have noted, the current test which permutes states is overly conservative. It measures fixed effects and as it only considers the group level mean it is accordingly easily biased by individual outliers. This we have tried to account for by z-scoring sequenceness scores. We have also conferred on this with some of the authors of TDLM and discussed a yet unpublished method that aims to address this exact issue. The proposed new method uses a sign-flip permutation test at a group level and therefore implements a random-effects model of the data. This significance test has markedly increased power while still controlling for FWER. However, while we show in our power analysis that the new method is indeed more sensitive, it does not materially change the interpretation of the data. We have included this novel method in the paper and added it into the main analysis and most of the simulations.

      (3) Can the authors sketch any directions for alternative methods? It seems we need an algorithm that outperforms TDLM, but not many clues or speculations are given as to what that might look like. Relatedly, no technical or "internal" critique is provided. What is it about TDLM that causes it to be so weak?

      We believe there are several shortcomings and bottlenecks within TDLM that need to be evaluated and improved. While we highlight these issues in the discussion section titled “Improving TDLMs sensitivity,” we agree that we should provide a clearer outline of its current shortcomings. We have now added to the discussion to expand on that we think needs improvement (‘fixed time lag’) and also add a summary statement at the end of the relevant paragraph to recap the main issues needed for an improved successor method. The new paragraphs read:

      “Lastly, there are certain assumptions that TDLM makes that might not hold (see Methods Study II): Current implementations look for a fixed time lag that is the same across all participants and between all reactivation events. If time lags differ across participants, TDLM will fail to find them. Similarly, TDLM assumes a fixed sequence order and is not robust against slight within-sequence permutations or in-sequencemissing reactivation events. However, from other data sources., such as hippocampal place cell recordings, it is known that such permutations can occur where some states are skipped or fail to decode during replay. Similarly, it is assumed that each reactivation event lasts between 10-30 milliseconds, but the true temporal evolution of reactivation measured by TDLM is currently unknown. Future method development might focus on improving invariance to these assumptions.

      […]

      In summary, there are several areas where TDLM might be improved, including a restriction in its search space, improvement in classifiers, a validation of localizer representation transfer to other domains (e.g. memory representations), and the extension of TDLM to render it more robust against violations of its core assumptions.”

      Reviewer #2 (Public review):

      Weaknesses:

      The sample size is small (n=21, after exclusions), even for TDLM studies (which typically have somewhere between 25-40 participants). The authors address this somewhat through a power analysis of the relationship between replay and behavioural performance in their simulations, but this is very dependent on the assumptions of the simulation. Further, according to their own power analysis, the replay-behaviour correlations are seriously underpowered (~10% power according to Figure 7C), and so if this is to be taken at face value, their own null findings on this point (Figure 3C) could therefore just reflect under sampling as opposed to methodological failure. I think this point needs to be made more clearly earlier in the manuscript.

      We agree with the referee that our sample is smaller than previous studies due to participant exclusion criteria. However, the take-away message from our behavioural simulation and bootstrapping is that even with larger sample sizes, it is difficult to overcome baseline fluctuations of sequenceness, even if very strong replay patterns were detectable and sample sizes were of similar size to that of previous studies. Therefore, we are not convinced that that our null findings are fully explained by the smaller sample size compared to that of previous studies, Additionally, we show that even within the range of other studies, similar power would have been expected (Supplement Figure 11). However, it is true that in general null findings can be explained by under-sampling, under the assumption that an effect is present. To amplify this point, we have added the following to the Figure 3C:

      “[…]. NB, however, as our simulation shows, correlations of sequenceness with behavioural markers are likely to be underpowered and occur only with very high replay rates or much higher sample size. See our simulation discussion for a more detailed explanation on how correlations may be inherently biased, where fluctuations in baseline sequenceness overshadow individual scaling with behavioural markers.”

      Furthermore, we have added the following paragraph to the discussion to highlight this point and refer to a power analysis we have now added to the supplement (see next answer):

      “Sample sizes in previous TDLM literature usually range between 20 to 40 participants. A bootstrap power analysis shows that even at those sample sizes, power would remain low unless unrealistically high replay rates are assumed (Supplement Figure 11). Our bootstrap simulation shows that a correlation analysis between sequenceness and behaviour would in these cases be drastically underpowered, even under an assumption of high replay densities.”

      Finally, we have added a remark about the sample size to the limitations section, as naturally, an increase in sample size would yield higher power:

      “Finally, while initially planning for thirty participants, due to exclusion criteria, our study featured fewer participants than most previous studies using TDLM (i.e. usually 25-40, but 21 in our study). While we are confident that our simulation results hold under these sample sizes, as sample sizes of other studies show comparable power to ours (Fehler! Verweisquelle konnte nicht gefunden werden.), we cannot fully rule out a possibility that our null-findings are explained by a lack in power alone.”

      Relatedly, it would be very useful if one of the recommendations that come out of the simulations in this paper was a power analysis for detecting sequenceness in general, as I suspect that the small sample size impacts this as well, given that sequenceness effects reported in other work are often small with larger sample sizes. Further, I believe that the authors' simulations of basic sequenceness effects would themselves still suffer from having a small number of subjects, thereby impacting statistical power. Perhaps the authors can perform a similar sort of bootstrapping analysis as they perform for the correlation between replay and performance, but over sequenceness itself?

      We agree with the referee that this, in principle, is a great idea. However, the way that significance thresholds are calculated poses a conceptual problem for such an analysis: as for significance threshold we are defining the maximum sequenceness value across all participants, all time lags and all permutations. This sequenceness value is compared against the mean of all participants, disregarding the standard deviation. This maximum threshold would not change if we bootstrapped some of our samples. Additionally, the 95% would also not change significantly. To illustrate this point, we have added this analysis to the supplement, as Supplement Figure 10. However, the new sign-flip permutation test we now include allows for such a comparison, as it takes variance between participants into account as well! We have included all three variants of the power analysis and the figure description now reads:

      “Supplement Figure 11 Power analysis of sequenceness significance for bootstrapped samples sizes. A) Powermap for state-permutation thresholds. However, here the bootstrap approach suffers from a conceptual problem: significance thresholds are defined by the permutation maximum and/or 95-percentile of the maximums across all sequence-permutations across participants. If we resample bootstrap-participants from our existing pool, the maximum thresholds computed will remain relatively stable across resampled participants, as it only compares against the mean and disregards the standard deviation. B) The newly presented statistical approach is significantly more sensitive at higher sample sizes. Note that even then, 80% power is only reached with replay density of higher than 50 min-1 at a sample size of 60 participants. Additionally, the sign-flip permutation test assumes that the mean is at zero. As we observed a non-zero mean due to spurious oscillations, we subtracted the mean sequenceness of the baseline condition from each participant before permuting to achieve a null distribution with mean zero, as otherwise, we would have found significant replay effects in the baseline condition at increasing sample size. Nevertheless, due to the higher sensitivity, the new sign-flip test is recommended over the previous sequence-permutation-based test. Colours indicate the power from 0 to 1 for different bootstrapped sample sizes and densities. 80% power thresholds are outlined in black.”

      The task paradigm may introduce issues in detecting replay that are separate from TDLM. First, the localizer task involves a match/mismatch judgment and a button press during the stimulus presentation, which could add noise to classifier training separate from the semantic/visual processing of the stimulus. This localizer is similar to others that have been used in TDLM studies, but notably in other studies (e.g., Liu, Mattar et al., 2021), the stimulus is presented prior to the match/mismatch judgment. A discussion of variations in different localizers and what seems to work best for decoding would be useful to include in the recommendations section of the discussion.

      We agree and thank the referee for raising this issue. Note, we acknowledge we forgot to mention that these trials were excluded from classifier training. Our rationale of presenting the oddball during stimulus presentation, and not thereafter, was an assumption that by first presenting the audio and then the visual cue we would create more generalized representations that would be less modalitydependent. However, importantly, we excluded all trials that were oddballs from localizer training. Therefore we assume that this particular design choice will not greatly affect the decoder training. If some motor-preparation activity is present during the stimulus presentation, then it should be present equally across all trials and hence be ignored by the classifier as we balanced the transitions between images. We now added this information to the main text:

      “In each trial, a word describing the stimulus was played auditorily, after which the corresponding stimulus was shown. In ~11% of cases, there was a mismatch between word and image (oddball trials), and these trials were excluded from the localizer training.” Additionally in the methods section: “These oddball-trials were excluded from all further analysis and decoder training.”

      Nevertheless, we agree that the extant variety in localizer designs is underdiscussed where many assumptions of classifier training are not, as yet, fully validated. We have added a sentence highlighting different oddball paradigms to the section on the discussion of localizers and also add a summary statement with recommendations. The passage now reads:

      “Additionally, a wide variety of oddballs has been used (e.g. upside-down, scrambled, or mismatched images, cues presented visually, as words, auditorily, etc), and at this time it is unclear if these affect the representations that the classifier learns [...] In summary, we would expect a multimodal categorical localizer, and a classifier that isn’t trained on a specific timepoint, to generalize best.”

      Second, and more seriously, I believe that the task design for training participants about the expected sequences may complicate sequence decoding. Specifically, this is because two images (a "tuple") are shown together and used for prediction, which may encourage participants to develop a single bound representation of the tuple that then predicts a third image (AB -> C rather than A -> B, B -> C). This would obviously make it difficult to i) use a classifier trained on individual images to detect sequences and ii) find evidence for the intended transition matrix using TDLM. Can the authors rule out this possibility?

      We thank the reviewer for raising a possibility we have not considered! While there is some evidence that a single bound representation would have overlap with its constituents (especially before long term-consolidation) and therefore be detectable by the classifiers, we acknowledge the possibility that individual classifiers would fail to be sensitive to such a compound representation. In fact we find in the retrieval data some evidence for a combined replay of representations (where representations are replayed seemingly at the same time, see Kern 2024). We have added such a possibility to the interims-discussion of Study 1 as a qualification . However, this does not change the results or interpretation of our simulation which we consider is a key message of the paper.

      The relevant segment in the discussion section now reads:

      “Additionally, given that the stimuli were presented in combined triplets, participants may have formed a singular representation of associated items and subsequently replayed these (e.g., AB→C), instead of replaying item-by-item transitions (A→B→C). Under such a scenario, a classifier trained on individual items may fail to detect these newly formed bound representations, particularly if they diverge strongly from the single-item patterns. In our previous study where we address retrieval (Kern et al., 2024) we found that states were to varying extent co-reactivated, yet classifiers trained on single items retained sensitivity to detect these combined reactivation events. Consistent with this, prior work suggests that unified representations retain overlap with their constituent item representations (Dennis et al., 2024; Liang et al., 2020), however, there’s also evidence that different brain regions are involved if representational unitization occurs (Staresina & Davachi, 2010), potentially confusing classifiers. Therefore, we cannot exclude that rest-related consolidation replays engendered unitized representations that were insufficiently captured by our singleitem classifiers.“

      Participants only modestly improved (from 76-82% accuracy) following the rest period (which the authors refer to as a consolidation period). If the authors assume that replay leads to improved performance, then this suggests there is little reason to see much taskrelated replay during rest in the first place. This limitation is touched on (lines 228-229), but I think it makes the lack of replay finding here less surprising. However, note that in the supplement, it is shown that the amount of forward sequenceness is marginally related to the performance difference between the last block of training and retrieval, and this is the effect I would probably predict would be most likely to appear. Obviously, my sample size concerns still hold, and this is not a significant effect based on the null hypothesis testing framework the authors employ, but I think this set of results should at least be reported in the main text.

      We disagree that an absence or presence of replay might be inferred from an absolute memory enhancement. While consolidation can lead to absolute improvement of performance in, for example, motor memory domains one formulation is that in declarative learning tasks replay stabilizes latent memory traces, and in such a scenario would not necessarily lead to a boosted performance. While many declarative consolidation studies report an increase of performance compared to a control condition (i.e. without a consolidation window), this does not necessarily entail an absolute performance increase, as replay might just act to protect against loss of memory traces. Therefore, the modest increase we observe does not inference as to the presence of absence of replay absent a proper control condition.

      We did expect to find a correlation between replay and individual behavioural. Indeed, a weak correlation with performance and sequenceness can be detected. However, as we also show any such correlation is overshadowed by baseline fluctuations in sequenceness such that its overall validity is questionable, even under very high replay rates. We are therefore circumspect about this correlation, even if it was significant. Therefore, in the discussion, we chose to refrain from putting much focus on this correlation. Nevertheless, we do add a short statement to the corresponding figure label, discussing this precise issue. The segment now reads:

      “While we found a non-significant relation between a memory performance enhancement and post-learning forward sequenceness we are cautious not to overinterpret these results. As in the section “Correlation with behaviour only present at high replay speeds” the noted correlational measure oscillates heavily with baseline sequenceness fluctuations, and any true replay effect is likely to be overshadowed by such fluctuations.”

      I was also wondering whether the authors could clarify how the criterion over six blocks was 80% but then the performance baseline they use from the last block is 76%? Is it just that participants must reach 80% within the six blocks *at some point* during training, but that they could dip below that again later?

      We thank the reviewer for highlighting this point: The first block wherein participants reached >80% ended the learning blocks. After a maximum of six blocks the learning session was ended regardless of performance. Therefore, some participants’ learning blocks were ended after six blocks and without them reaching a performance of 80%.. While we described this in the Methods section, it was missing from the Results Study I section, which now contains:

      “[...] Participants then learned triplets of associated items according to a graph structure. Within the learning session, participants performed a maximum of six learning blocks, but the session was stopped if participants reached 80% memory performance (criterion learning,, up to a memory performance criterion of 80% (see Methods for details)”

      The Figure 2 description now contains

      “[...] Participants’ completed up to six blocks of learning trials. After reaching 80% in any block, no more learning blocks were performed (criterion learning) [...]”

      Lastly, there was a mistake in the Behavioural results section, which stated “All thirty participants, except one, [..] to criterion of 80%.” This is an error. In our preregistration, we defined to only include participants that successfully learned anything at all above chance. Here,we meant that only one participant failed to reach a criterion that we defined as “successful learning”. We fixed it and it now reads

      “with an accuracy above 50% (which we preregistered beforehand as an exclusion criterion for “successful learning above chance”).”

      Additionally, we have noted this for clarity in the methods section and excuse this mistake:

      “Additionally, as successful above-chance learning was necessary for the paradigm, we ensured all remaining participants had a retrieval performance of at least 50% (one participant had to be excluded, but was already excluded due to low decoding performance).”

      Because most of the conclusions come from the simulation study, there are a few decisions about the simulations that I would like the authors to expand upon before I can fully support their interpretations. First, the authors use a state-to-state lag of 80ms and do not appear to vary this throughout the simulations - can the authors provide context for this choice? Does varying this lag matter at all for the results (i.e., does the noise structure of the data interact with this lag in any way?)

      This was a deliberate choice but we acknowledge the reasoning behind this was not detailed in our initial submission. We chose a lag of 80 millisecond for three reasons: first, it is distant from the 9-11 Hz alpha oscillations we observed in our participants and does not share a harmonic with the alpha rhythm; second, we wanted to get a clear picture of the effect of simulated replay that is as isolated as possible from spurious sequenceness confounders present in the baseline condition. Thus, we chose a lag in which the sequenceness score was close to zero in the baseline condition; thirdly , in this revision, we subtracted the mean sequenceness value of the baseline such that any simulation effects would start, on average, at zero sequenceness. In this way, we could attribute any increase in sequenceness to the experimentally inserted replay, that was independent of spurious oscillations. Finally (but less importantly), as we observed that a correlation of sequenceness with behaviour was fluctuated strongly, for the reason detailed above, we chose a lag in which a correlation was as close as possible to zero. If we had not chosen a lag that adhered to these conditions, we were at risk of measuring simulated replay plus spurious sequenceness confounders.

      We have added a sentence to the main text detailing this justification:

      “We chose this timepoint (80 msec state to state lag) as its sequenceness value was close to zero in the baseline condition as well as being distant to the observed alpha rhythms of the participants (which varied between ~9-11 Hz). Additionally, we subtracted the mean sequenceness value of the baseline at 80 milliseconds lag such that any simulation effects would, on average, start at zero sequenceness “

      Additionally, we now add a more detailed explanation to the methods section.

      “This time lag (80 msec) was chosen in order to isolate precisely an effect of the experimentally inserted sequenceness. Thus, we chose a lag at which the mean baseline sequenceness was close to zero and where the correlation with behaviour was low. Additionally, we subtracted the mean sequenceness value (at 80 milliseconds) at baseline from the specific lag recorded for each participant, such that simulation effects would be initialized at zero sequenceness on average enabling any effects to be attributed purely to inserted replay. Additionally, we excluded time lags too close to the alpha rhythms of participants (which varied between ~9-11 Hz) or lags which would have a harmonic with the rhythm.”

      Second, it seems that the approach to scaling simulated replays with performance is rather coarse. I think a more sensitive measure would be to scale sequence replays based on the participants' responses to *that* specific sequence rather than altering the frequency of all replays by overall memory performance. I think this would help to deliver on the authors' goal of simulating an "increase of replay for less stable memories" (line 246).

      The referee makes an excellent point and our simulations could be rendered more realistic by inserting the actual tuples that participants answered correctly. If we understand the point correctly, there are two different ways replay might be impacted by performance: First, we can conjecture that there is greater replay if memory performance is not saturated. Second, replay only occurs for content that has actually been encoded!

      The main reasons why we chose to simulate the entire sequence being replayed for each participant is based on the following. TDLM is implemented such that the amount of replay alone is relevant, and actual transitions are not affecting the results beyond noise. Under the assumption that class-specific classifiers perform equally well, simulating A->B, B->C or simulating A->B, A->B yields equivalent results. However, results can differ if this assumption is violated. By drawing from the entire space of classes we insert, we minimize the risk of some classifiers being worse than others for some participants. For example, if we simulated only A->B for some participant instead of the whole sequence, and by chance classifier A performs suboptimally, we would then introduce additional unwanted variance into our results.

      Secondly, from our reading of the literature we infer that replay is increased generally (i.e. density of learning-specific replay is increased) for less stable memories. However, we do not have indicators of memory strength, but only a binary “remembered or not”. As TDLM is invariant to the actual transitions being replayed and only indexes the number of transitions, we chose to ignore which transitions we insert and only scaled the amount of replay.

      We have added an analysis to the Appendix that discusses this specific aspect of our study where we show that results are equivalent if we simulate replay of “A->B B->C C->D” or only “A->B A->B A->B A->B”. As we do not know how replay density interacts with memory trace stability, we opted to leave the current simulation as is. The corresponding paragraph and figure description now read:

      “From literature we know that replay is increased after learning and that less stable memories are replayed more often. We simulated this effect by scaling our replay density inversely with performance. However, for simplicity, in our simulation, we inserted sampled transitions from all valid transitions given by the graph structure, i.e., the following transitions were valid: However, this meant that some participants would have transitions inserted that they didn’t actually remember. To show that this would not change results, we simulated two scenarios: In the full sequence scenario, all valid graph transitions are inserted (i.e. all participant’s replay is sampled from 'A->B, B->C, C->D, D->E, E->F, F->G, G->E, E->H, H->I, I->B, B->J, J->A'). In the second scenario (memorized transitions) we only replayed transitions that the participant actually retrieved correctly during the post-resting state testing sessions (i.e. a participant’s replay would have been sampled from ‘A->B, B->C, G->E, E->H, H>I’, if those were the ones he remembered). In both scenarios, the number of events is kept constant. The results are equivalent as can be seen in Appendix A Figure 3. NB this only holds under the assumptions that classifiers are equally good at decoding each class.”

      […]

      “TDLM is insensitive towards which transitions are replayed and only sensitive to how many transitions are detected in total. Here we simulate transitions either sampled from the full graph (light orange/green) or participant-specific transitions of trials that participants correctly remembered (dark orange/green). Shaded areas denote the standard error across participants.”

      On the other hand, I was also wondering whether it is actually necessary to use the real memory performance for each participant in these simulations - couldn't similar goals (with a better/more full sampling of the space of performance) be achieved with simulated memory performance as well, taking only the MEG data from the participant?

      The decision to use real memory performance is indeed arbitrary. We could have also used randomly sampled values. However, as we wanted to understand our nullresults better we opted to use real performance to adhere as close as possible to the findings we previously reported. Using uniformly sampled memory performance would be less explanatory w.r.t to our actual results of the resting state data that are reported in the first study we report in the manuscript (Study I).

      Nevertheless, our current implementation already presents an approach that samples the entire performance range for the sub-analysis focusing on the correlation with behaviour. Here, in the section on “best-case”-scenario, we implement this such that it spans factors from 1 to 0 (i.e., a participant with 100% performance gets a replay scale factor of 0 and hence no replay simulated, and the worst performing participant with 50% performance has a replay rate multiplied by 1). We scale the amount of replay with this factor. As a correlation is invariant to linear scaling, statistically this is equivalent to stretching the performance distribution from 0 to 100%. We have added a sentence to the methods to provide further focus on this point:

      “To assess how performance might affect replay in our specific dataset, we chose to use the original participants’ performance values instead of uniformly sampling the performance space (which ranged from 50 to 100%). However, for the correlation analysis, we additionally added a “best-case” scenario, in which we scale replay from 0 to 1, an approach that is statistically equivalent to scaling values to the full space of possible performance (0 to 100%) (see Results Study II: Simulation).”

      Finally, Figure 7D shows that 70ms was used on the y-axis. Why was this the case, or is this a typo?

      Thanks, this is indeed a typo, we fixed it.

      Because this is a re-analysis of a previous dataset combined with a new simulation study on that data aimed at making recommendations about how to best employ TDLM, I think the usefulness of the paper to the field could be improved in a few places. Specifically, in the discussion/recommendation section, the authors state that "yet unknown confounders" (line 295) lead to non-random fluctuations in the simulated correlations between replay detection and performance at different time lags. Because it is a particularly strong claim that there is the potential to detect sequenceness in the baseline condition where there are no ground-truth sequences, the manuscript could benefit from a more thorough exploration of the cause(s) of this bias in addition to the speculation provided in the current version.

      We are currently working on a theoretical basis to explain these spurious sequenceness confounders in the baseline condition. Indeed, in our preliminary work, in certain contexts we can induce significant sequenceness in the absence of any replay signal during baseline. However, this work is at an early stage and we still have some conceptional problems to solve before we are confident enough with these data. We believe at present it would be premature to add these data to the current manuscript. Nevertheless, we now mention these spurious sequenceness confounders to raise awareness for the field and also add greater context to the discussion, highlighting one of the issues that we think is of importance:

      “[…] For example, if two classifiers’ probabilities oscillate at 10 Hz but at a different phase, a spurious time lag can be found reflecting this phase shift. We speculate that more complex interactions between classifiers oscillating at different phases are also conceivable.”

      In addition, to really provide that a realistic simulation is necessary (one of the primary conclusions of the paper), it would be useful to provide a comparison to a fully synthetic simulation performed on this exact task and transition structure (in addition to the recreation of the original simulation code from the TDLM methods paper).

      Thank you for this suggestion! We have now added a synthetic simulation, trying to keep as close as possible to the original simulation code in Liu et al. (2021), while also incorporating our current means of simulating the data (i.e. scaling by performance). We think this synthetic simulation greatly improves the paper and gives weight to our suggestion about the superiority of a hybrid approach. Additionally, it prompted us to look closer at patterns that are inserted in the synthetic simulation and perform a comparative analysis. We have now added the simulation to the main text, together with a methodological explanation of how we simulated the data in the methods section. We also added a discussion on the results and why we think a hybrid approach is currently superior to synthetic approach. The whole new section is too long to paste here – it is found after the main simulation section in the manuscript. We have also added another sentence to the abstract referring to this new inclusion.

      Finally, I think the authors could do further work to determine whether some of their recommendations for improving the sensitivity of TDLM pan out in the current data - for example, they could report focusing not just on the peak decoding timepoint but incorporating other moments into classifier training.

      While we do understand the desire to test further refinement to TDLM on the data directly, we intentionally do not include such analyses in the current paper. Our experience also informs us that there is an enormous branching factor of parameters when applying TDLM, with implications for significance of results in one or other direction. However, as there are currently only limited ways to know how well parameter changes actually improve the sensitivity to replay versus exacerbate potential underlying confounders that induce spurious sequenceness (e.g., we can get significant replay in the control condition with some parameter changes). To exclude such false positive findings, we opt for a relatively strict adherence to previously published approaches. Thus, in the current paper, we limit ourselves to assessing the reliability and robustness of previous approaches.

      Furthermore, while training on a later timepoint might increase sensitivity for a classifier when transferring between different modalities (e.g. visual to memory representation), this approach does not transfer well in our simulations, as the inserted patterns are from the same modality. We consider other, more bespoke studies, are better suited to improve classifier training. NB also see our recently started Kaggle challenge to tackle this problem: https://www.kaggle.com/competitions/the-imagine-decoding-challenge

      However, we have added a note about this dilemma to the improvement section. The section now includes:

      “Nevertheless, as the considerable branching factor poses a threat of increased falsepositive findings we opt to focus the current simulations on previously published pipelines and parameters. Future studies should systematically evaluate parameter choices on TDLM under different conditions, something that is beyond the remit of the current study.”

      Lastly, I would like the authors to address a point that was raised in a separate public forum by an author of the TDLM method, which is that when replays "happen during rest, they are not uniform or close." Because the simulations in this work assume regularly occurring replay events, I agree that this is an important limitation that should be incorporated into alternative simulations to ensure the lack of findings is not because of this assumption.

      The temporal distribution of replay throughout the resting state should not matter, as TDLM is invariant w.r.t to how replay events are distributed within the analysis window. Specifically, it does not matter if replay events occur in bursts or are uniformly distributed. Only the number of transitions is relevant, where they occur or if they are close to each other is not relevant to the numerical results (as long as the refractory window is kept, too short distances will lead to interactions between events and reduce sensitivity).). To emphasize this point, we have added another simulation which is shown in Appendix A.1 and Appendix A Figure 1. We have referenced it in the text and added the following paragraph in the Methods section

      Additionally, the timepoints of inserting replay within the resting state are sampled from a uniform distribution. Even though TDLM tracks reactivation events over time, at a macro-scale the algorithm is invariant to the temporal distribution. At each time step, the GLM regresses onto a future time step up to the maximum time lag of interest, yielding a predictor per lag. However, these predictors within the GLM are independently assessed, and hence, TDLM is, outside of the time lag window, relatively invariant to the temporal distribution of replay. To demonstrate our claim, we simulated uniform replay vs “bursty” replay that only occurs in some parts of the resting state, both yield equivalent sequenceness results (see Appendix A.1).

      Reviewer #3 (Public review):

      (1) I am still left wondering why other studies were able to detect replay using this method. My takeaway from this paper is that large time windows lead to high significance thresholds/required replay density, making it extremely challenging to detect replay at physiological levels during resting periods. While it is true that some previous studies applying TDLM used smaller time windows (e.g., Kern's previous paper detected replay in 1500ms windows), others, including Liu et al. (2019), successfully detected replay during a 5-minute resting period. Why do the authors believe others have nevertheless been able to detect replay during multi-minute time windows?

      (Due to similarity, we combined our responses with the first question of Reviewer 1)

      We are reluctant to make sweeping judgments in relation to previous literature as we wanted to prioritize on advancing methodology instead. The previous TDLM literature uses a diverse set of tasks and cognitive processes. As we state ourselves, it is possible that replay bursts in short time windows are well detectable by TDLM. We were intentionally cautious to directly critique previous studies without detailed re-analysis of their work and wanted to leave such a conclusion up to the reader. However, we realize that such a “thought-starter” might be warranted and improve the paper. Therefore, we have added the following paragraph to the discussion about “improving TDLMs sensitivity”:

      “Finally, what do our simulations imply for the broader MEG replay literature? Our implementation successfully detects replay when boundary conditions are met, as shown in the simulation. But sensitivity depends critically on high fidelity between the analysis window and the amount of replay events. A systematic evaluation of these conditions across prior studies is beyond the scope of this paper, so we do not want to adjudicate earlier findings and leave this assessment up to the reader. Instead, we delineate the boundary conditions and urge future work to conduct power analyses where possible and include simulations that approximate realistic experimental conditions.”

      For example, some studies using TDLM report evidence of sequenceness as a contrast between evidence of forwards (f) versus backwards (b) sequenceness; sequenceness was defined as ZfΔt - ZbΔt (where Z refers to the sequence alignment coefficient for a transition matrix at a specific time lag). This use case is not discussed in the present paper, despite its prevalence in the literature. If the same logic were applied to the data in this study, would significant sequenceness have been uncovered? Whether it would or not, I believe this point is important for understanding methodological differences between this paper and others.

      This approach was first introduced as part of a TDLM-predecessor that utilized crosscorrelations (Kurth-Nelson 2016), where this step is a necessity to extract any sequenceness signal at all by subtracting signals that are present in both (akin to an EEG reference). However, its validity is less clear when fwd and bkw are estimated separately, as is in the GLM case. The rationale behind subtracting here is the same as for autocorrelations: there are oscillatory confounds present in the data that introduce spurious sequenceness in both directions alike, i.e. at the same time lag, that can simply be removed by subtracting. However, this assumption only holds if the sole confounder is auto-correlations caused by a global signal that oscillates at all sensors at the same phase. In our own experience, and mentioned in the discussion, we do not think this assumption holds. Arguably, there are more complex interactions at play that cannot be removed by such a subtraction such as an increase in false positives if confounders are in an opposite direction at a specific time lag. This assumption-violation can be seen in our baseline condition, where other spurious sequenceness diverges in opposite directions for some time lags (e.g. at ~90 ms where forward sequenceness is negative and backward sequenceness is positive). We reasoned that oscillatory confounds are more stable when comparing pre vs post for the same direction than comparing within session between forward minus backward.

      Finally, we note issues introduced by the various ways that sequenceness has been analysed in previous papers: normalization of sequenceness (z-scoring across time lags or across participants or not at all), normalization of probabilities (taking raw decision scores, z-scoring, soft-max, dividing by mean, subtracting mean), taking a windowed approach and summing sequenceness scores, not to mention the various classifier choices that can be made, and all of this can be applied before subtracting conditions from each other or before subtraction. In our experience there is insufficient regard to control for multiple comparison when running all these analyses risking selectivity in reporting.

      Nevertheless, subtracting forward from backward replay is probably as valid as post minus pre. Therefore, we have added fwd-bkw plots to the supplement and explained some of the reasoning for not reporting them in the main text in the figure label. The figure label and reference now read:

      “Finally, we report forward minus backward sequenceness and our motivation for using an across-session post-pre comparison instead of within-session forwardbackward in Supplement Figure 10.”

      […]

      “Forward minus backward sequenceness within each resting state session. Previous papers often report subtraction of backward from forward sequenceness (fwd-bkw) as a means to remove oscillatory confounds that impact both sequenceness directions in synchrony. While required in early cross-correlation approaches (KurthNelson et al., 2016), its validity in GLM-based frameworks depends on an assumption that confounds are global and in-phase across sensors. We observed this assumption is violated in our baseline data, where spurious sequenceness occasionally diverges in opposite directions at specific time lags (e.g., ~90 ms). In such instances, subtraction would increase the false-positive rate rather than suppress noise. In Figure 3B, we prioritized the comparison of pre-task versus post-task sequenceness within the same direction, as oscillatory confounds appeared more stable across time within a single direction, as opposed to across directions within a single session. However, we consider both approaches are valid. We now provide the fwd-bkw plots for completeness and comparison with previous literature. A) forward minus backwards sequenceness for Control (left) and Post-Learning resting-state (right). B) T-value distribution of the sign-flip permutation test for Control (left) and Post-Learning resting-state (right)”

      (2) Relatedly, while the authors note that smaller time windows are necessary for TDLM to succeed, a more precise description of the appropriate window size would greatly improve the utility of this paper. As it stands, the discussion feels incomplete without this information, as providing explicit guidance on optimal window sizes would help future researchers apply TDLM effectively. Under what window size range can physiological levels of replay actually be detected using TDLM? Or, is there some scaling factor that should be considered, in terms of window size and significance threshold/replay density? If the authors are unable to provide a concrete recommendation, they could add information about time windows used in previous studies (perhaps, is 1500ms as used in their previous paper a good recommendation?).

      We currently do not have an empirical estimate of which window sizes are appropriate. While we used 1500ms in our previous paper, this was solely given by the experiment design which had a 1.5s wait period before the next stimulus. Our recommendation for best guidance on this matter would be to investigate related intracranial literature for SWR rate increases under similar experimental conditions. We have added the following paragraph to the discussion:

      “At this stage we cannot offer a general recommendation for window sizes as they are likely to depend on details of the research paradigm. However, intracranial recordings can be used as proxy to estimate the duration of replay bursts, for example as reported in (Norman et al., 2019) where increased SWRs were seen up to 1500 ms after retrieval cue onset”

      (3) In their simulation, the authors define a replay event as a single transition from one item to another (example: A to B). However, in rodents, replay often traverses more than a single transition (example: A to B to C, even to D and E). Observing multistep sequences increases confidence that true replay is present. How does sequence length impact the authors' conclusions? Similarly, can the authors comment on how the length of the inserted events impacts TDLM sensitivity, if at all?

      Good point! So far, most papers do not seem to include multi-step TDLM and in our experience rightfully, as it is conceptionally difficult to define clear significance thresholds while keeping in mind that shorter sub-sequences are contained within a longer sequence (e.g. ABC contains both AB and BC and a longer dependency of AC) that renders it difficult to define the correct way to create a null distribution for the permutation test. Therefore, we tried to stay as close as possible to previous approaches and only looked for single-step transitions. Nevertheless, we have added an analysis to the supplement comparing how TDLM behaves if we simulate A->B->C or A->B and separate B->C. It shows that TDLM is only sensitive to the number of transitions present in the data, and it does not matter if they are chained or chunked. The segment reads:

      “We intentionally designed our study to encourage replay of triplets. However, this begs the question as to whether it matters if triplets or individual chunks of a sequence are replayed at different time points? Here, we simulated two scenarios. In one, we inserted replay of single transitions alone with a refractory period, e.g. A->B and separate B->C transitions. In a second scenario, we simulate replay of chained triplets, e.g. A->B->C, with a distance of 80 milliseconds each. Importantly, we kept the number of transitions constant (i.e., A->B, … B->C and where A->B->C would both have 2 transitions. This creates a context wherein a four-minute resting state would have ~100 events of A->B->C inserted and ~200 events of A->B or B->C, such that in both cases this results in the same number of single step transitions. We found both are equivalent, with TDLM agnostic to the length of sequence trains, i.e., it does not matter if replay is chunked or chained under the assumption that the number of transitions remains fixed, as can be seen in Appendix A Figure 2”

      And the reference Figure description reads:

      “TDLM is invariant to the length of sequence replay trains under an assumption that the number of target transitions (e.g. single steps) is fixed. We simulated replay either as two temporally separate A->B, B->C events (light orange/green) or as a single A>B->C event (dark orange/green), both yielding equivalent sequenceness. Shaded areas denote the standard error across participants”

      For example, regarding sequence length, is it possible that TDLM would detect multiple parts of a longer sequence independently, meaning that the high density needed to detect replay is actually not quite so dense? (example: if 20 four-step sequences (A to B to C to D to E) were sampled by TDLM such that it recorded each transition separately, that would lead to a density of 80 events/min).

      Indeed, this is an interesting proposal. We intentionally kept our simulation close to the way previous simulations were set-up (i.e. Liu & Dolan et al 2021, Liu & Mattar 2021) by simulating one-step transitions and simulated them such that there is no overlap between separate events (e.g. by defining a refractory period). If the duration of replay is increased then we would also need to increase the length of the refractory period, resulting in a reduced upper limit of how much replay can occur in a 1-minute time window. This in turn would approximate roughly the same number of transitions that can be inserted into the resting state and, as detailed above, would yield the same results. Nevertheless, as we chose to use replay density and not transition density as a marker, the density would be reduced, even if the number of transitions stay the same. We have added an analysis using multi-step replay to the supplement and discuss its implications and caveats. In the main discussion we have added the following segment:

      “Similarly, in our simulation, for simplicity and to keep consistency with previousstimulations, we restricted replay events to span two reactivation events. While the characteristics of replay as measured by TDLM are unknown, it is conceivable that several steps can be replayed within one replay event. We show that the vanilla version of TDLM is fundamentally sensitive to the number of single-step transitions alone, and disregards if these are replayed chained or chunked (Appendix A.2 and Appendix A Figure 2). Nevertheless, if the number of reactivation events chained within a replay event increases, TDLMs sensitivity is increased relative to the replay density and thresholds are reached earlier (see Appendix A Figure 4). See Appendix A.4 for a simulation of multi-step replay events and our discussion of the caveats.”

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      Please label the various significance thresholds in the legend of Figure 3.

      We have labelled all the thresholds in the figure legends.

      Reviewer #2 (Recommendations for the authors):

      I think that some of the clarity is hampered because there is a bit too much reliance on explanations from the previous paper using this task, which hampers clarity in the paper. For example, Figure 1 is not particularly useful for understanding the study in its current form; I found myself relying almost exclusively on Supplementary Figure 1 (which is from the previous paper). I'd recommend presenting some version of SF1 in the main text instead. Another example of this overreliance on the previous paper is that, as far as I can tell, the present paper never explicitly states which transitions are being tested in TDLM. In the prior work, it states "all allowable graph transitions", and so I assumed this was the same here, but the paper should standalone without having to go back to the other study. I'd recommend that the authors revise the paper in these and other places where the previous paper is mentioned.

      Thanks for raising this point! We were uncertain ourselves how to deal with the overlap in content and did not want to bloat the paper or plagiarize ourselves too much. On the advice of the referee have implemented the following to improve the manuscript and reduce a reliance on the previous paper:

      Supplement Figure 1 is indeed crucial to understanding the experiment. We have moved it to the methods section under Methods: Procedure

      Added more stimulus description to the Methods: Localizer section

      Included more details about the localizer and graph learning that were missing before

      We have added the note about which transitions we were looking for in the Methods section. Additionally, we have added this information to the Results section of Study 1.

      There are also a few typos I noticed:

      (1) Line 73: "during in the context of."

      (2) Line 287: " to exploring the."

      We fixed the typos.

      Reviewer #3 (Recommendations for the authors):

      (1) Why did the authors choose an 80ms state-to-state time lag for their simulation? I believe they should make the reason for this decision clear in the main text.

      Indeed, this point was also raised by the other reviewer. We have added a sentence to the main text about the rationale behind this decision:

      “We chose this timepoint (80 millisecond state-to-state lag) as its sequenceness value was close to zero in the baseline condition as well as being distant to the observed alpha rhythms of the participants (which varied between ~9-11 Hz). Additionally, we subtracted the mean sequenceness value of the baseline at 80 millisecond lag such that any simulation effects would, on average, start at zero sequenceness.“

      Additionally, we have added some further explanation to the Methods section.

      “This time lag (80 msec) was chosen in order to isolate precisely an effect of the experimentally inserted sequenceness. Thus, we chose a lag at which the mean baseline sequenceness was close to zero and where the correlation with behaviour was low. Additionally, we subtracted the mean sequenceness value (at 80 milliseconds) at baseline from the specific lag recorded for each participant, such that simulation effects would be initialized at zero sequenceness on average enabling any effects to be attributed purely to inserted replay. Additionally, we excluded time lags too close to the alpha rhythms of participants (which varied between ~9-11 Hz) or lags which would have a harmonic with the rhythm.“

      (2) Line 168: Can the authors define what these conservative and liberal criteria are in the text?

      We have added definitions of the criteria in the text. The text now reads:

      “[..] significance thresholds (conservative, i.e. the maximum sequenceness across all permutations and timepoints or liberal criteria, i.e. the 95% percentile of aforementioned sequenceness).”

      (3) Line 478: "calculate" instead of "calculated".

      (4) Figure 7 D: y-axis is labeled "70 ms" I believe it should be labeled 80 ms.

      Thanks, we fixed the two typos.

      (5) With replay defined as sequential reactivation at a compressed temporal timescale, many of the iEEG citations (lines 54-55) do not demonstrate replay (they show stimulus reinstatement or ripple activity, but not sequential replay). Replay studies in humans using intracranial methods have been mostly limited to those measuring single-unit activity, a good example being Vaz et al., 2020 (https://www.science.org/doi/10.1126/science.aba0672).

      We agree that, under a strict definition articulated by Genzel et al. that defines replay as sequential reactivation, many prior human iEEG studies are better described as stimulus reinstatement or ripple-related activity rather than true sequence replay. We have revised the text accordingly and now highlight the few intracranial microelectrode studies that demonstrate replay of firing sequences at the cellular/ensemble level in humans (Eichenlaub et al., 2020; Vaz et al., 2020), distinguishing these from macro-scale iEEG work providing indirect evidence alone.

      The revised paragraph now reads:

      “Replay has been shown using cellular recordings across a variety of mammalian model organisms (Hoffman & McNaughton, 2002; Lee & Wilson, 2002; Pavlides & Winson, 1989). Replay studies in humans using intracranial recordings are few, but include work demonstrating compressed replay of firing-pattern sequences in motor cortex during rest (Eichenlaub et al., 2020) as well as single-unit replay of trialspecific cortical spiking sequences during episodic retrieval (Vaz et al., 2020). By contrast, most iEEG studies report stimulus-specific reinstatement or ripple-locked activity changes without explicit demonstration of temporally compressed sequential replay (Axmacher et al., 2008; Staresina et al., 2015). As these methods are only applied under restricted clinical circumstances, such as during pre-operative neurosurgical assessments, this limits opportunities to investigate human replay. Therefore, this gives urgency to efforts aimed at developing novel methods to investigate human replay non-invasively.”

      (6) The expectations about replay frequency are grounded in literature on hippocampal replay sequences. However, MEG captures signals from across the entire brain, and the hippocampal contribution is likely relatively weak compared to all other signals. This raises an important question: is TDLM genuinely unable to detect replay at physiological (i.e., hippocampal) levels, or is it instead detecting a different form of sequential reactivation - possibly involving cortex or other regions - that may occur more frequently? More broadly, when we have evidence of replay from TDLM, do we believe it is the same thing as replay of CA1 place cell spiking sequences, as detected in rodents? Commenting on this distinction would help further develop theories of replay and what TDLM is measuring.

      This is indeed an important point that has garnered relatively little attention. While there is some evidence of a relation to hippocampal replay in form of high-frequency power increase in the hippocampus, ultimately it is not possible to know without intracranial recordings, as signal strength from those regions is rather poor in MEG.

      We have added the following segment to the manuscript that discusses these issues:

      “However, while we are using indices of SWRs as a proxy for replay density estimation, the relationship between hippocampal replay and replay detected by TDLM remains uncertain. While current decoding approaches measure replay-like phenomena on cortical sites, previous papers have reported a power increase in hippocampal areas coinciding with replay episodes as detected by TDLM. Nevertheless, it is conceivable that cortical replay found by TDLM could occur independently of hippocampal replay and SWRs and be generated by different mechanisms. Some TDLM-studies find a replay state-to-state time lag of above 100 ms, much slower than e.g. previously reported place cell replay. Future studies should employ simultaneous intracranial and cortical surface recordings to establish the relationship between hippocampal replay and replay found by TDLM.”

    1. eLife Assessment

      This study presents an assessment of the effect of lactate dehydrogenase (LDH) inhibition on the activity of glycolysis and tricarboxylic acid cycle. The data were collected and analyzed using solid and validated methodology. This paper makes a useful contribution to the field as it considers a control analysis of LDH flux. The findings differ from other published findings likely due to the time course of the incubations used to assess metabolism. While such comparative studies were not presented in the manuscript, the manuscript should be interpreted in light of this critical distinction.

    2. Reviewer #1 (Public Review):

      Summary:

      Zeng et al. have investigated the impact of inhibiting lactate dehydrogenase (LDH) on glycolysis and the tricarboxylic acid cycle. LDH is the terminal enzyme of aerobic glycolysis or fermentation that converts pyruvate and NADH to lactate and NAD+ and is essential for the fermentation pathway as it recycles NAD+ needed by upstream glyceraldehyde-3-phosphate dehydrogenase. As the authors point out in the introduction, multiple published reports have shown that inhibition of LDH in cancer cells typically leads to a switch from fermentative ATP production to respiratory ATP production (i.e., glucose uptake and lactate secretion are decreased, and oxygen consumption is increased). The presumed logic of this metabolic rearrangement is that when glycolytic ATP production is inhibited due to LDH inhibition, the cell switches to producing more ATP using respiration. This observation is similar to the well-established Crabtree and Pasteur effects, where cells switch between fermentation and respiration due to the availability of glucose and oxygen. Unexpectedly, the authors observed that inhibition of LDH led to inhibition of respiration and not activation as previously observed. The authors perform rigorous measurements of glycolysis and TCA cycle activity, demonstrating that under their experimental conditions, respiration is indeed inhibited. Given the large body of work reporting the opposite result, it is difficult to reconcile the reasons for the discrepancy. In this reviewer's opinion, a reason for the discrepancy may be that the authors performed their measurements 6 hours after inhibiting LDH. Six hours is a very long time for assessing the direct impact of a perturbation on metabolic pathway activity, which is regulated on a timescale of seconds to minutes. The observed effects are likely the result of a combination of many downstream responses that happen within 6 hours of inhibiting LDH that causes a large decrease in ATP production, inhibition of cell proliferation, and likely a range of stress responses, including gene expression changes.

      Strengths:

      The regulation of metabolic pathways is incompletely understood, and more research is needed, such as the one conducted here. The authors performed an impressive set of measurements of metabolite levels in response to inhibition of LDH using a combination of rigorous approaches.

      Weaknesses:

      Glycolysis, TCA cycle, and respiration are regulated on a timescale of seconds to minutes. The main weakness of this study is the long drug treatment time of 6 hours, which was chosen for all the experiments. In this reviewer's opinion, if the goal was to investigate the direct impact of LDH inhibition on glycolysis and the TCA cycle, most of the experiments should have been performed immediately after or within minutes of LDH inhibition. After 6 hours of inhibiting LDH and ATP production, cells undergo a whole range of responses, and most of the observed effects are likely indirect due to the many downstream effects of LDH and ATP production inhibition, such as decreased cell proliferation, decreased energy demand, activation of stress response pathways, etc.

      Comments on revisions:

      Based on the response to comments that the authors have submitted, I do not think I need to make any changes to my review, as the time course experiment that could have explained the difference between reported results and extensive prior literature has not been performed.

    3. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      Zeng et al. have investigated the impact of inhibiting lactate dehydrogenase (LDH) on glycolysis and the tricarboxylic acid cycle. LDH is the terminal enzyme of aerobic glycolysis or fermentation that converts pyruvate and NADH to lactate and NAD+ and is essential for the fermentation pathway as it recycles NAD+ needed by upstream glyceraldehyde-3-phosphate dehydrogenase. As the authors point out in the introduction, multiple published reports have shown that inhibition of LDH in cancer cells typically leads to a switch from fermentative ATP production to respiratory ATP production (i.e., glucose uptake and lactate secretion are decreased, and oxygen consumption is increased). The presumed logic of this metabolic rearrangement is that when glycolytic ATP production is inhibited due to LDH inhibition, the cell switches to producing more ATP using respiration. This observation is similar to the well-established Crabtree and Pasteur effects, where cells switch between fermentation and respiration due to the availability of glucose and oxygen. Unexpectedly, the authors observed that inhibition of LDH led to inhibition of respiration and not activation as previously observed. The authors perform rigorous measurements of glycolysis and TCA cycle activity, demonstrating that under their experimental conditions, respiration is indeed inhibited. Given the large body of work reporting the opposite result, it is difficult to reconcile the reasons for the discrepancy. In this reviewer's opinion, a reason for the discrepancy may be that the authors performed their measurements 6 hours after inhibiting LDH. Six hours is a very long time for assessing the direct impact of a perturbation on metabolic pathway activity, which is regulated on a timescale of seconds to minutes. The observed effects are likely the result of a combination of many downstream responses that happen within 6 hours of inhibiting LDH that causes a large decrease in ATP production, inhibition of cell proliferation, and likely a range of stress responses, including gene expression changes.

      Strengths:

      The regulation of metabolic pathways is incompletely understood, and more research is needed, such as the one conducted here. The authors performed an impressive set of measurements of metabolite levels in response to inhibition of LDH using a combination of rigorous approaches.

      Weaknesses:

      Glycolysis, TCA cycle, and respiration are regulated on a timescale of seconds to minutes. The main weakness of this study is the long drug treatment time of 6 hours, which was chosen for all the experiments. In this reviewer's opinion, if the goal was to investigate the direct impact of LDH inhibition on glycolysis and the TCA cycle, most of the experiments should have been performed immediately after or within minutes of LDH inhibition. After 6 hours of inhibiting LDH and ATP production, cells undergo a whole range of responses, and most of the observed effects are likely indirect due to the many downstream effects of LDH and ATP production inhibition, such as decreased cell proliferation, decreased energy demand, activation of stress response pathways, etc.

      We thank reviewer for the careful reading of our manuscript, the accurate summary of the prevailing model, and the positive assessment of the rigor of our measurements. We agree that much prior literature reports increased oxygen consumption following LDH inhibition, and we recognize that our finding—coordinated suppression of glycolysis, the TCA cycle, and OXPHOS—differs from this prevailing interpretation. We address below the reviewer’s main concern regarding the 6-hour time point and clarify the conceptual scope of our study.

      (1) Scope: steady-state metabolic regulation versus immediate transient effects

      The reviewer raises an important point that many metabolic perturbations can trigger rapid, transient responses within seconds to minutes, whereas our measurements were performed after sustained LDH inhibition. We agree that very early time points would be required if the primary goal were to isolate the most immediate, proximal consequence of LDH inhibition before downstream propagation. However, the objective of our study is different: we aim to characterize the metabolic steady state re-established after sustained inhibition of LDH activity, because this adapted steady state is more relevant for understanding long-term metabolic consequences and therapeutic outcomes of LDH inhibition in cancer cells.

      (2) Genetic LDHA/LDHB knockout: comparison of two steady states

      A related point applies to the LDHA/LDHB knockout models. We fully agree that the knockout process necessarily involves a temporal perturbation during cell line generation and adaptation. Nevertheless, the experimental comparison in our study is explicitly between two steady states: the baseline steady state of control cells and the steady state achieved after stable genetic disruption of LDHA or LDHB. The observation that LDHA or LDHB knockout alone had minimal effects on glycolysis and respiration indicates that partial reduction of LDH activity can be compensated in a steady-state manner, consistent with the exceptionally high catalytic capacity of LDH in cancer cells relative to upstream rate-limiting enzymes.

      (3) LDH-activity-dependent quantitative relationships support stable metabolic states

      Importantly, our conclusions do not rely on a single inhibitor condition at a single time point. Rather, we established quantitative steady-state relationships between residual LDH activity and pathway behavior across a wide range of LDH inhibition. These LDH-activity-dependent data strongly support that the system resides in stable metabolic states at different degrees of LDH activity, rather than reflecting non-specific collapse due to prolonged stress.

      Specifically, we observed that when LDH activity was reduced from 100% to approximately ~9% (e.g., by genetic perturbation and partial pharmacologic inhibition), glucose consumption and lactate production remained essentially unchanged, indicating maintenance of a steady-state glycolytic flux despite substantial LDH inhibition. Only when LDH activity was further reduced below this threshold did glycolytic flux decrease in a graded manner, consistent with a nonlinear control structure (Figure 8 A & B)).

      Likewise, the isotope tracing results showed distinct LDH-activity-dependent transitions in TCA cycle labeling patterns. Over the range in which LDH activity decreased from 100% to ~9%, the [<sup>13</sup>C<sub>6</sub>]glucose-derived labeling pattern of citrate remained largely unchanged, whereas deeper inhibition led to a decrease in m2 citrate with a compensatory rise in higher-order citrate isotopologues, consistent with altered flux entry versus cycling/retention in the TCA cycle (Figure 8C). Similarly, [<sup>13</sup>C<sub>5</sub>]glutamine tracing revealed that deeper LDH inhibition reduced the direct m5 contribution, accompanied by corresponding shifts in other isotopologues (Figure 8D). These graded, quantitative transitions—rather than an abrupt global failure—support the interpretation of distinct metabolic steady states across LDH activity levels, linking LDH inhibition to changes in both glycolysis and mitochondrial metabolism.

      (4) Reconciling discrepancies with prior studies

      We agree that multiple prior studies have reported increased oxygen consumption or enhanced oxidative metabolism following LDH inhibition in cancer cells. However, we note that this prevailing notion often persists because LDH inhibition is frequently discussed by analogy to the classical Pasteur and Crabtree effects, in which cells toggle between fermentation and respiration depending on oxygen and glucose availability. We believe this analogy can be misleading.

      In the Pasteur effect, the metabolic shift is primarily driven by oxygen limitation, i.e., restriction of the terminal electron acceptor for the mitochondrial electron transport chain, which enforces reliance on fermentation. In the Crabtree effect, high glucose availability suppresses respiration through regulatory mechanisms while glycolysis is strongly activated. Both phenomena are fundamentally controlled by oxygen availability and respiratory capacity, rather than by inhibition of a specific cytosolic enzyme.

      By contrast, LDH inhibition is mechanistically distinct: it directly perturbs cytosolic redox recycling by limiting NADH-to-NAD<sup>+</sup> regeneration and can therefore constrain upstream glycolytic flux (particularly at GAPDH) and reshape pathway thermodynamics. Under conditions where LDH inhibition sufficiently limits effective NAD<sup>+</sup> availability and reduces glycolytic flux into pyruvate, the downstream consequence is reduced carbon input into the TCA cycle and suppressed OXPHOS—consistent with our experimental measurements. We therefore suggest that divergent outcomes reported across studies likely reflect differences in residual LDH activity, cell-type–specific metabolic wiring, and the extent to which glycolytic flux remains sustained versus becoming redox-limited upstream, rather than a universal Pasteur/Crabtree-like “switch” from fermentation to respiration. Accordingly, interpreting LDH inhibition as a Pasteur/Crabtree-like toggle may oversimplify the biochemical consequences of disrupting cytosolic NAD<sup>+</sup> regeneration.

      We have revised the Discussion to clarify this conceptual distinction and to avoid relying on comparisons that are not mechanistically equivalent to LDH inhibition.

      Reviewer #2 (Public Review):

      Summary:

      Zeng et al. investigated the role of LDH in determining the metabolic fate of pyruvate in HeLa and 4T1 cells. To do this, three broad perturbations were applied: knockout of two LDH isoforms (LDH-A and LDH-B), titration with a non-competitive LDH inhibitor (GNE-140), and exposure to either normoxic (21% O2) or hypoxic (1% O2) conditions. They show that knockout of either LDH isoform alone, though reducing both protein level and enzyme activity, has virtually no effect on either the incorporation of a stable 13C-label from a 13C6-glucose into any glycolytic or TCA cycle intermediate, nor on the measured intracellular concentrations of any glycolytic intermediate (Figure 2). The only apparent exception to this was the NADH/NAD+ ratio, measured as the ratio of F420/F480 emitted from a fluorescent tag (SoNar).

      The addition of a chemical inhibitor, on the other hand, did lead to changes in glycolytic flux, the concentrations of glycolytic intermediates, and in the NADH/NAD+ ratio (Figure 3). Notably, this was most evident in the LDH-B-knockout, in agreement with the increased sensitivity of LDH-A to GNE-140 (Figure 2). In the LDH-B-knockout, increasing concentrations of GNE-140 increased the NADH/NAD+ ratio, reduced glucose uptake, and lactate production, and led to an accumulation of glycolytic intermediates immediately upstream of GAPDH (GA3P, DHAP, and FBP) and a decrease in the product of GAPDH (3PG). They continue to show that this effect is even stronger in cells exposed to hypoxic conditions (Figure 4). They propose that a shift to thermodynamic unfavourability, initiated by an increased NADH/NAD+ ratio inhibiting GAPDH explains the cascade, calculating ΔG values that become progressively more endergonic at increasing inhibitor concentrations.

      Then - in two separate experiments - the authors track the incorporation of 13C into the intermediates of the TCA cycle from a 13C6-glucose and a 13C5-glutamine. They use the proportion of labelled intermediates as a proxy for how much pyruvate enters the TCA cycle (Figure 5). They conclude that the inhibition of LDH decreases fermentation, but also the TCA cycle and OXPHOS flux - and hence the flux of pyruvate to all of those pathways. Finally, they characterise the production of ATP from respiratory or fermentative routes, the concentration of a number of cofactors (ATP, ADP, AMP, NAD(P)H, NAD(P)+, and GSH/GSSG), the cell count, and cell viability under four conditions: with and without the highest inhibitor concentration, and at norm- and hypoxia. From this, they conclude that the inhibition of LDH inhibits the glycolysis, the TCA cycle, and OXPHOS simultaneously (Figure 7).

      Strengths:

      The authors present an impressively detailed set of measurements under a variety of conditions. It is clear that a huge effort was made to characterise the steady-state properties (metabolite concentrations, fluxes) as well as the partitioning of pyruvate between fermentation as opposed to the TCA cycle and OXPHOS.

      A couple of intermediary conclusions are well supported, with the hypothesis underlying the next measurement clearly following. For instance, the authors refer to literature reports that LDH activity is highly redundant in cancer cells (lines 108 - 144). They prove this point convincingly in Figure 1, showing that both the A- and B-isoforms of LDH can be knocked out without any noticeable changes in specific glucose consumption or lactate production flux, or, for that matter, in the rate at which any of the pathway intermediates are produced. Pyruvate incorporation into the TCA cycle and the oxygen consumption rate are also shown to be unaffected.

      They checked the specificity of the inhibitor and found good agreement between the inhibitory capacity of GNE-140 on the two isoforms of LDH and the glycolytic flux (lines 229 - 243). The authors also provide a logical interpretation of the first couple of consequences following LDH inhibition: an increased NADH/NAD+ ratio leading to the inhibition of GAPDH, causing upstream accumulations and downstream metabolite decreases (lines 348 - 355).

      Weaknesses:

      Despite the inarguable comprehensiveness of the data set, a number of conceptual shortcomings afflict the manuscript. First and foremost, reasoning is often not pursued to a logical conclusion. For instance, the accumulation of intermediates upstream of GAPDH is proffered as an explanation for the decreased flux through glycolysis. However, in Figure 3C it is clear that there is no accumulation of the intermediates upstream of PFK. It is unclear, therefore, how this traffic jam is propagated back to a decrease in glucose uptake. A possible explanation might lie with hexokinase and the decrease in ATP (and constant ADP) demonstrated in Figure 6B, but this link is not made.

      We appreciate the reviewer's critical comment. In Figure 3C, there is no accumulation of F6P or G6P, which are upstream of PFK1. This is because the PFK1-catalyzed reaction sets a significant thermodynamic barrier. Even with treatment using 30 μM GNE-140, the ∆G<sub>PFK1</sub> (Gibbs free energy of the PFK1-catalyzed reaction) remains -9.455 kJ/mol (Figure 3D), indicating that the reaction is still far from thermodynamic equilibrium, thereby preventing the accumulation of F6P and G6P.

      We agree with the reviewer that hexokinase inhibition may play a role, this requires further investigation.

      The obvious link between the NADH/NAD+ ratio and pyruvate dehydrogenase (PDH) is also never addressed, a mechanism that might explain how the pyruvate incorporation into the TCA cycle is impaired by the inhibition of LDH (the observation with which they start their discussion, lines 511 - 514).

      We agree with the reviewer’s comment. In this study, we did not explore how the inhibition of LDH affects pyruvate incorporation into the TCA cycle. As this mechanism was not investigated, we have titled the study:

      "Elucidating the Kinetic and Thermodynamic Insights into the Regulation of Glycolysis by Lactate Dehydrogenase and Its Impact on the Tricarboxylic Acid Cycle and Oxidative Phosphorylation in Cancer Cells."

      It was furthermore puzzling how the ΔG, calculated with intracellular metabolite concentrations (Figures 3 and 4) could be endergonic (positive) for PGAM at all conditions (also normoxic and without inhibitor). This would mean that under the conditions assayed, glycolysis would never flow completely forward. How any lactate or pyruvate is produced from glucose, is then unexplained.

      This issue also concerned me during the study. However, given the high reproducibility of the data, we consider it is true, but requires explanation. The PGAM-catalyzed reaction is tightly linked to both upstream and downstream reactions in the glycolytic pathway. In glycolysis, three key reactions catalyzed by HK2, PFK1, and PK are highly exergonic, providing the driving force for the conversion of glucose to pyruvate. The other reactions, including the one catalyzed by PGAM, operate near thermodynamic equilibrium and primarily serve to equilibrate glycolytic intermediates rather than control the overall direction of glycolysis, as previously described by us (J Biol Chem. 2024 Aug8;300(9):107648).

      The endergonic nature of the PGAM-catalyzed reaction does not prevent it from proceeding in the forward direction. Instead, the directionality of the pathway is dictated by the exergonic reaction of PFK1 upstream, which pushes the flux forward, and by PK downstream, which pulls the flux through the pathway. The combined effects of PFK1 and PK may account for the observed endergonic state of the PGAM reaction.

      However, if the PGAM-catalyzed reaction were isolated from the glycolytic pathway, it would tend toward equilibrium and never surpass it, as there would be no driving force to move the reaction forward.

      Finally, the interpretation of the label incorporation data is rather unconvincing. The authors observe an increasing labelled fraction of TCA cycle intermediates as a function of increasing inhibitor concentration. Strangely, they conclude that less labelled pyruvate enters the TCA cycle while simultaneously less labelled intermediates exit the TCA cycle pool, leading to increased labelling of this pool. The reasoning that they present for this (decreased m2 fraction as a function of DHE-140 concentration) is by no means a consistent or striking feature of their titration data and comes across as rather unconvincing. Yet they treat this anomaly as resolved in the discussion that follows.

      GNE-140 treatment increased the labeling of TCA cycle intermediates by [<sup>13</sup>C<sub>6</sub>]glucose but decreased the OXPHOS rate, we consider the conflicting results as an 'anomaly' that warrants further explanation. To address this, we analyzed the labeling pattern of TCA cycle intermediates using both [<sup>13</sup>C<sub>6</sub>]glucose and [<sup>13</sup>C<sub>5</sub>]glutamine. Tracing the incorporation of glucose- and glutamine-derived carbons into the TCA cycle suggests that LDH inhibition leads to a reduced flux of glucose-derived acetyl-CoA into the TCA cycle, coupled with a decreased flux of glutamine-derived α-KG, and a reduction in the efflux of intermediates from the cycle. These results align with theoretical predictions. Under any condition, the reactions that distribute TCA cycle intermediates to other pathways must be balanced by those that replenish them. In the GNE-140 treatment group, the entry of glutamine-derived carbon into the TCA cycle was reduced, implying that glucose-derived carbon (as acetyl-CoA) entering the TCA cycle must also be reduced, or vice versa.

      This step-by-step investigation is detailed under the subheading "The Effect of LDHB KO and GNE-140 on the Contribution of Glucose Carbon to the TCA Cycle and OXPHOS" in the Results section in the manuscript.

      In the Discussion, we emphasize that caution should be exercised when interpreting isotope tracing data. In this study, treatment of cells with GNE-140 led to an increase labeling percentage of TCA cycle intermediates by [<sup>13</sup>C<sub>6</sub>]glucose (Figure 5A-E). However, this does not necessarily imply an increase in glucose carbon flux into TCA cycle; rather, it indicates a reduction in both the flux of glucose carbon into TCA cycle and the flux of intermediates leaving TCA cycle. When interpreting the data, multiple factors must be considered, including the carbon-13 labeling pattern of the intermediates (m1, m2, m3, ---) (Figure 5G-K), replenishment of intermediates by glutamine (Figure 5M-V), and mitochondrial oxygen consumption rate (Figure 5W). All these factors should be taken into account to derive a proper interpretation of the data.

      Reviewer #3 (Public Review):

      Hu et al in their manuscript attempt to interrogate the interplay between glycolysis, TCA activity, and OXPHOS using LDHA/B knockouts as well as LDH-specific inhibitors. Before I discuss the specifics, I have a few issues with the overall manuscript. First of all, based on numerous previous studies it is well established that glycolysis inhibition or forcing pyruvate into the TCA cycle (studies with PDKs inhibitors) leads to upregulation of TCA cycle activity, and OXPHOS, activation of glutaminolysis, etc (in this work authors claim that lowered glycolysis leads to lower levels of TCA activity/OXPHOS). The authors in the current work completely ignore recent studies that suggest that lactate itself is an important signaling metabolite that can modulate metabolism (actual mechanistic insights were recently presented by at least two groups (Thompson, Chouchani labs). In addition, extensive effort was dedicated to understanding the crosstalk between glycolysis/TCA cycle/OXPHOS using metabolic models (Titov, Rabinowitz labs). I have several comments on how experiments were performed. In the Methods section, it is stated that both HeLa and 4T1 cells were grown in RPMI-1640 medium with regular serum - but under these conditions, pyruvate is certainly present in the medium - this can easily complicate/invalidate some findings presented in this manuscript. In LDH enzymatic assays as described with cell homogenates controls were not explained or presented (a lot of enzymes in the homogenate can react with NADH!). One of the major issues I have is that glycolytic intermediates were measured in multiple enzyme-coupled assays. Although one might think it is a good approach to have quantitative numbers for each metabolite, the way it was done is that cell homogenates (potentially with still traces of activity of multiple glycolytic enzymes) were incubated with various combinations of the SAME enzymes and substrates they were supposed to measure as a part of the enzyme-based cycling reaction. I would prefer to see a comparison between numbers obtained in enzyme-based assays with GC-MS/LC-MS experiments (using calibration curves for respective metabolites, of course). Correct measurements of these metabolites are crucial especially when thermodynamic parameters for respective reactions are calculated. Concentrations of multiple graphs (Figure 1g etc.) are in "mM", I do not think that this is correct.

      We thank the reviewer’s comment and the following are clarification of the conceptual framework, the quantitative methodology, and the experimental basis supporting our conclusions.

      (1) “It is well established that glycolysis inhibition or forcing pyruvate into the TCA cycle… leads to upregulation of TCA/OXPHOS… (authors claim lowered glycolysis leads to lower TCA/OXPHOS)”

      This framing is not accurate in the context of our study. PDK inhibition and LDH inhibition are fundamentally different perturbations. PDK inhibition directly promotes mitochondrial pyruvate oxidation by enabling PDH flux, whereas LDH inhibition primarily perturbs cytosolic redox balance (free NADH/NAD<sup>+</sup>) and thereby constrains upstream glycolytic reactions, particularly the GAPDH step. Therefore, the metabolic outcomes of these interventions are not expected to be identical and should not be treated as interchangeable.

      Importantly, we do not “ignore” prior studies proposing increased OXPHOS after LDH inhibition; we explicitly cite and summarize this prevailing interpretation in the Introduction. Our study was motivated precisely because this interpretation does not resolve key quantitative inconsistencies, including (i) the large mismatch between glycolytic flux and mitochondrial oxidative capacity, and (ii) the exceptionally high catalytic capacity of LDH relative to upstream rate-limiting glycolytic enzymes. These constraints raise a mechanistic question: how does LDH inhibition actually suppress glycolytic flux in intact cancer cells, and what are the consequences for TCA cycle and OXPHOS?

      Our central contribution is the identification of a biochemical mechanism supported by integrated measurements of fluxes, metabolite concentrations, redox state, and reaction thermodynamics: LDH inhibition increases free NADH/NAD<sup>+</sup>, decreases free NAD<sup>+</sup> availability, inhibits GAPDH, drives accumulation/depletion patterns in glycolytic intermediates, shifts Gibbs free energies of near-equilibrium reactions (PFK1–PGAM segment), suppresses pyruvate production, and consequently reduces carbon input into TCA cycle and OXPHOS. These analyses are not provided by most prior work and directly address the mechanistic gap.

      (2) Lactate signaling (Thompson/Chouchani) and metabolic modeling (Titov/Rabinowitz)

      These research directions are valuable, but they address questions that are different from the one investigated here. Our manuscript focuses on steady-state biochemical control of metabolic flux by LDH inhibition through redox-linked kinetics and pathway thermodynamics.

      (3) Pyruvate in RPMI

      Pyruvate in standard medium does not invalidate our conclusions. All experimental comparisons were performed under identical conditions across groups, and the major conclusions rely on orthogonal measurements including glycolytic flux (glucose consumption/lactate production), OCR profiling, and isotope tracing with [<sup>13</sup>C<sub>6</sub>]glucose and [<sup>13</sup>C<sub>5</sub>] glutamine, which directly quantify carbon entry into lactate and TCA cycle intermediates. These tracer-based results are not confounded by unlabeled extracellular pyruvate in a way that would reverse the mechanistic conclusions.

      (4) LDH activity assay in homogenates and “many enzymes can react with NADH”

      This concern is overstated. In the LDH assay, substrates are pyruvate + NADH, and the measured signal reflects NADH oxidation coupled to pyruvate reduction. In cell lysates, LDH is uniquely abundant and catalytically efficient for this reaction pair, and the inhibitor-response behavior matches the known LDHA/LDHB selectivity of GNE-140 and the cellular phenotypes. Thus, the assay is mechanistically specific in this context.

      (5) Enzyme-coupled metabolite assays and request for LC–MS validation

      The reviewer’s implication that enzyme-coupled assays are intrinsically unreliable is incorrect. Enzymatic cycling assays are a widely used quantitative approach when performed with proper specificity and calibration, and they are particularly useful for labile glycolytic intermediates that are challenging to quantify reproducibly by MS without specialized quenching, derivatization, and isotope dilution standards.

      We agree that MS-based quantification is valuable, and we have developed LC–MS methods for selected metabolites. However, absolute quantification of these intermediates remains technically difficult due to the inherent limitation of this method and, in our hands, did not provide uniformly robust performance for all intermediates required for thermodynamic analysis.

      (6) Units (“mM”)

      The metabolite concentration units are correct.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      If the goal is to investigate the direct impact of LDH inhibition, then in my opinion, most of these experiments need to be repeated at a very early time point immediately after or a few minutes after LDH inhibition. I understand that this is a tremendous amount of work that the authors might not want to pursue. I do want to highlight that the quality of the experiments performed in this work is impressive. I hope the authors continue investigating this subject and look forward to reading their future manuscripts on this topic.

      We thank the reviewer for this thoughtful and constructive comment and for the positive assessment of the experimental quality of our work.

      We fully agree that measurements at very early time points after LDH inhibition would be required if the goal were to isolate an immediate, proximal molecular event occurring before downstream propagation. However, the primary objective of our study is not to dissect a single instantaneous biochemical consequence of LDH inhibition, but rather to characterize the metabolic steady state that is re-established after sustained suppression of LDH activity, which we believe is more relevant for understanding the long-term metabolic and therapeutic consequences of LDH inhibition in cancer cells.

      (1) Scope: steady-state metabolic regulation versus immediate transient effects

      The reviewer raises an important point that many metabolic perturbations can trigger rapid, transient responses within seconds to minutes, whereas our measurements were performed after sustained LDH inhibition. We agree that very early time points would be required if the primary goal were to isolate the most immediate, proximal consequence of LDH inhibition before downstream propagation. However, the objective of our study is different: we aim to characterize the metabolic steady state re-established after sustained inhibition of LDH activity, because this adapted steady state is more relevant for understanding long-term metabolic consequences and therapeutic outcomes of LDH inhibition in cancer cells.

      (2) Genetic LDHA/LDHB knockout: comparison of two steady states

      A related point applies to the LDHA/LDHB knockout models. We fully agree that the knockout process necessarily involves a temporal perturbation during cell line generation and adaptation. Nevertheless, the experimental comparison in our study is explicitly between two steady states: the baseline steady state of control cells and the steady state achieved after stable genetic disruption of LDHA or LDHB. The observation that LDHA or LDHB knockout alone had minimal effects on glycolysis and respiration indicates that partial reduction of LDH activity can be compensated in a steady-state manner, consistent with the exceptionally high catalytic capacity of LDH in cancer cells relative to upstream rate-limiting enzymes.

      (3) LDH-activity-dependent quantitative relationships support stable metabolic states

      Importantly, our conclusions do not rely on a single inhibitor condition at a single time point. Rather, we established quantitative steady-state relationships between residual LDH activity and pathway behavior across a wide range of LDH inhibition. These LDH-activity-dependent data strongly support that the system resides in stable metabolic states at different degrees of LDH activity, rather than reflecting non-specific collapse due to prolonged stress.

      Specifically, we observed that when LDH activity was reduced from 100% to approximately ~9% (e.g., by genetic perturbation and partial pharmacologic inhibition), glucose consumption and lactate production remained essentially unchanged, indicating maintenance of a steady-state glycolytic flux despite substantial LDH inhibition. Only when LDH activity was further reduced below this threshold did glycolytic flux decrease in a graded manner, consistent with a nonlinear control structure.

      Likewise, the isotope tracing results showed distinct LDH-activity-dependent transitions in TCA cycle labeling patterns. Over the range in which LDH activity decreased from 100% to ~9%, the [<sup>13</sup>C<sub>6</sub>]glucose-derived labeling pattern of citrate remained largely unchanged, whereas deeper inhibition led to a decrease in m2 citrate with a compensatory rise in higher-order citrate isotopologues, consistent with altered flux entry versus cycling/retention in the TCA cycle. Similarly, [<sup>13</sup>C<sub>5</sub>]glutamine tracing revealed that deeper LDH inhibition reduced the direct m5 contribution, accompanied by corresponding shifts in other isotopologues. These graded, quantitative transitions—rather than an abrupt global failure—support the interpretation of distinct metabolic steady states across LDH activity levels, linking LDH inhibition to changes in both glycolysis and mitochondrial metabolism.

      Reviewer #2 (Recommendations For The Authors):

      All in all, the authors would benefit from collaboration with a group more well-versed in quantitative aspects of metabolism (such as Metabolic Control Analysis) and modelling methods (such as flux analysis) to boost the interpretation and impact of their really nice data set.

      We sincerely thank the reviewer for this insightful and constructive suggestion. We fully agree that collaboration with groups specializing in quantitative metabolic analysis, such as Metabolic Control Analysis and flux modeling, would further expand the interpretative depth and broader impact of this work.

      The primary objective of the present work, however, was not to construct a global mathematical model, but to experimentally dissect the biochemical mechanism by which LDH inhibition coordinately suppresses glycolysis, the TCA cycle, and OXPHOS, integrating enzyme kinetics with thermodynamic constraints at steady state. Within this scope, we focused on experimentally demonstrable relationships between LDH activity, redox balance, GAPDH perturbation, thermodynamic shifts in near-equilibrium reactions, and emergent flux suppression.

      We fully recognize the power of MCA and related modeling approaches in formalizing control coefficients and system-level sensitivities, and we view our dataset as particularly well suited to support such future analyses. We therefore see this work as providing a robust experimental platform upon which more comprehensive quantitative modeling can be built, either in future studies or through collaboration with specialists in metabolic modeling.

      Reviewer #3 (Recommendations For The Authors):

      We sincerely thank the reviewer for the important suggestions.

      (1) I strongly disagree that "regulation of glycolytic flux".. "remained largely unexplored.”

      Our original wording was meant to emphasize not the absence of prior work on glycolytic flux regulation, but rather that the specific biochemical mechanism by which LDH regulates glycolytic flux—particularly through the integrated effects of enzyme kinetics, redox balance, and thermodynamic constraints within the pathway—has not been fully elucidated.

      To avoid any ambiguity or overstatement, we have revised the relevant text to more precisely reflect this intent. The revised wording now reads:

      “This study elucidates a biochemical mechanism by which lactate dehydrogenase influences glycolytic flux in cancer cells, revealing a kinetic–thermodynamic interplay that contributes to metabolic regulation.”

      We believe this revised phrasing more accurately acknowledges prior work while clearly defining the specific mechanistic contribution of the present study.

      (2) Very confusing in the Introduction section: "If LDH is inhibited at the LDH step..”

      We sincerely thank the reviewer for pointing out the potential confusion caused by the phrase “If LDH is inhibited at the LDH step” in the Introduction.

      Our intention was to contrast two conceptual models of LDH inhibition. The first is the conventional view, in which the effect of LDH inhibition is assumed to be confined to the LDH-catalyzed reaction itself, leading primarily to local accumulation of pyruvate and its redirection toward mitochondrial metabolism. The second, which is supported by our data, is that LDH inhibition initiates a system-wide biochemical response, perturbing redox balance, upstream enzyme kinetics, and the thermodynamic state of the glycolytic pathway, ultimately resulting in coordinated suppression of glycolysis, the TCA cycle, and OXPHOS.

      We agree that the original phrasing was ambiguous and potentially misleading. To improve clarity, we have revised the text as follows:

      “If the effect of LDH inhibition were confined solely to its catalytic step…”

      (3) The entire introduction part when the authors attempt to explain how decreased glycolysis will lead to decreased mitochondrial respiration is confusing.

      We would like to clarify that the Introduction does not attempt to explain how decreased glycolysis leads to decreased mitochondrial respiration. Rather, the final paragraph of the Introduction is intended to highlight an unresolved conceptual inconsistency in the existing literature and to motivate the central question addressed in this study.

      Specifically, we summarize the prevailing view that LDH inhibition redirects pyruvate toward mitochondrial metabolism and enhances oxidative phosphorylation, and then point out that this interpretation is difficult to reconcile with quantitative considerations, such as the large disparity between glycolytic and mitochondrial flux capacities and the excess catalytic activity of LDH relative to upstream glycolytic enzymes. These observations are presented to emphasize that the biochemical mechanism linking LDH inhibition to changes in glycolysis and mitochondrial respiration has not been fully resolved.

      Importantly, the Introduction does not propose a mechanistic explanation for the observed suppression of mitochondrial respiration; rather, it poses this as an open question, which is then systematically addressed through experimental analysis in the Results section.

      (4) Line 144: "which is 81(HeLa-LDHAKO) -297(HeLa-Ctrl) times"- here and in many other places wording is confusing to the reader.

      Our intention was to emphasize the significant redundancy of LDH activity relative to hexokinase (HK), the first rate-limiting enzyme in the glycolysis pathway, in cancer cells.

      Specifically, we wanted to express that in HeLa-Ctrl cells, the total LDH activity is 297 times that of HK activity; while in HeLa-LDHAKO cells, although the total LDH activity decreased, it was still 81 times that of HK activity. This data comes from supplement Table 1 in the paper and aims to provide quantitative evidence for "why knocking out LDHA or LDHB alone is insufficient to significantly affect glycolysis flux," because the remaining LDH activity is still far higher than the HK activity at the pathway entrance, sufficient to maintain flux.

      Based on your suggestion, we rewrite it in the revised draft with a more specific statement: "...the total activity of LDH in HeLa cells is very high, which is 297-fold higher than the first rate-limiting enzyme HK activity in HeLa-Ctrl cells and 81-fold higher in HeLa-LDHAKO cells.”

      (5) Line 153: "in the following four aspects:"- but what are these aspects, the text below has no corresponding subtitles, etc.

      Our intention was to indicate that after LDHA or LDHB knockout alone failed to affect the glycolysis rate, we further explored its potential impact on the glycolytic pathway from four deeper perspectives: the glucose carbon to pyruvate and lactate, the glucose carbon to subsidiary branches of glycolysis, the concentration of glycolytic intermediates and the thermodynamic state of the pathway, and the redox state of cytosolic free NADH/NAD<sup>+</sup>.

      Following your valuable suggestion, we have now added the aforementioned clear subtitles to these four aspects in the revised manuscript.

      (6) Lines 193, another example of the very confusing statement: "The results suggested that the loss of total LDH concentration was compensated.."

      The actual catalytic activity (reaction rate) of LDH is determined by both its enzyme concentration and substrate concentration (pyruvate and NADH). When the total LDH protein concentration (enzyme amount) in the cell is reduced through gene knockout, the reaction equilibrium is disrupted. To maintain sufficient lactate production flux to support a high glycolysis rate, the cell compensates by increasing the concentration of one of the substrates—free NADH (as shown in Figure 1I). This results in an increased substrate concentration, despite a reduction in the amount of enzyme, thus partially maintaining the overall reaction rate.

      We have revised the original statement to more accurately describe this kinetic equilibrium process: "The decrease in total LDH concentration was counterbalanced by a concomitant increase in the concentration of its substrate, free NADH, thereby maintaining the reaction velocity.”

      (7) Line 222-223: "did not or marginally significantly affect....”

      Our intention is to reflect the complexity of the data in Figure 1. Specifically: Regarding "did not affect": This means that there were no statistically significant differences in most key parameters, such as glycolytic flux (glucose consumption rate, lactate production rate). Regarding "or marginally significantly affected": This means that in a few indicators, although statistical calculations showed p-values less than 0.05, the absolute value of the difference was very small, with limited biological significance.

      To clarify this, we rewrite it as: "...did not significantly affect glucose-derived pyruvate entering into TCA cycle, neither significantly affect mitochondrial respiration, although statistically significant but minimal changes were observed in a few specific parameters (e.g., m3-pyruvate% in medium).”

      (8) It is very confusing to use the same colors for three GNE-140 drug concentrations (Figure 2a-b) and for 3 different cell lines right next to each other (Figure 2c-d).

      The figures have been revised accordingly.

      (9) Lines 263-273: nothing is new here as oxidized NAD+ is required for run glycolysis and LDH inhibition/KO leads to a high NADH/NAD+ ratio; Also below it is well known that reductive stress blocks serine biosynthesis;

      It is well established that oxidized NAD<sup>+</sup> is required for glycolysis, that LDH inhibition or knockout increases the NADH/NAD<sup>+</sup> ratio, and that reductive stress can suppress serine biosynthesis. We did not intend to present these observations as novel.

      The key point of this section is not the qualitative requirement of NAD<sup>+</sup> for GAPDH, but rather the mechanistic alignment between LDH inhibition, changes in free NAD<sup>+</sup> availability, and the emergence of GAPDH as a flux-controlling step within the glycolytic pathway under steady-state conditions. Previous studies have largely treated the increase in NADH/NAD<sup>+</sup> following LDH inhibition as a correlative or downstream effect, without directly demonstrating how this redox shift quantitatively propagates upstream to reorganize glycolytic flux distribution and thermodynamic driving forces.

      In our study, we explicitly link LDH inhibition to (i) an increase in free NADH/NAD<sup>+</sup> ratio, (ii) inhibition of GAPDH activity in intact cells, (iii) accumulation of upstream glycolytic intermediates, (iv) suppression of serine biosynthesis from 3-phosphoglycerate, and critically, (v) coordinated shifts in the Gibbs free energies of reactions between PFK1 and PGAM. This integrated kinetic–thermodynamic framework goes beyond the established qualitative understanding of NAD<sup>+</sup> dependence and provides a pathway-level mechanism by which LDH activity controls glycolytic flux.

      (10) Lines 368-370: "... we reached an alternative interpretation of the data.."- does not provide much confidence.

      Our intention was to prudently emphasize that we proposed a new interpretation based on detailed data, differing from conventional views. Our interpretation is grounded in key and consistent evidence from dual isotope tracing experiments using [<sup>13</sup>C<sub>6</sub>]glucose and [<sup>13</sup>C<sub>5</sub>]glutamine: The [<sup>13</sup>C<sub>6</sub>]glucose tracing data: the labeling pattern of citrate, the starting product of TCA cycle, showed a significant decrease in m+2 %. This directly reflects a reduction in the flux of newly generated acetyl-CoA from glucose entering the TCA cycle. Simultaneously, the sum of other isotopologues % (m+1/ m+3/ m+4/m+5/m+6) increased, indicating a longer retention time of the labeled carbon in the cycle, implying a simultaneous decrease in the flux of cycle intermediates effluxed for biosynthesis. [<sup>13</sup>C<sub>5</sub>]Glutamine tracing data: the labeling pattern of α-ketoglutarate showed a decrease in m+5 %, indicating a reduction in glutamine replenishment flux. The pattern of change in the total percentage of other isotopologues % (m+1/ m+2/ m+3/m+4) also supports the conclusion of reduced intermediate product efflux.

      These two sets of data corroborate each other, pointing to a unified conclusion: LDH inhibition not only reduces carbon source inflow into the TCA cycle but also decreases intermediate product efflux, leading to a decrease in overall cycle activity. Therefore, our "alternative interpretation" is a well-supported and more consistent explanation of our overall experimental results. We revise the original wording to: "Integrated analysis of dual isotope tracing data demonstrates that LDH inhibition reduces both influx and efflux of the TCA cycle..."

      (11) Lines 418-421: This entire discussion on how TCA cycle activity is decreased upon LDH inhibition is very confusing. I also would like to see these tracer studies when ETC is inhibited with different inhibitors.

      We would like to clarify that the mitochondrial respiration rate data presented in Figure 5W are based on studies using different ETC inhibitors, and the cell treatment conditions (including culture time, etc.) for these oxygen consumption measurements are consistent with the conditions for the [<sup>13</sup>C<sub>6</sub>]glucose and [<sup>13</sup>C<sub>5</sub>]glutamine isotope tracing experiments (Figure 5A-V). Therefore, the changes in TCA cycle flux revealed by the tracing data and the inhibition of OXPHOS rate shown by the respiration measurements are mutually corroborating evidence from the same experimental conditions.

      (12) Figure 6F, G - very limited representation of growth curves, why not perform these experiments with all corresponding cell lines and over multiple days. Especially since proliferation arrest vs cell death was implicated.

      We have provided the growth curves of the HeLa-Ctrl and HeLa-LDHAKO cell lines under the corresponding treatments in Figure 6—figure supplement 1, as a supplement to Figure 6F, G (HeLa-LDHBKO cells). The choice of 48 hours as the cutoff observation point is based on clear biological evidence: under the stress of hypoxia (1% O<sub>2</sub>) combined with GNE-140 treatment, HeLa-LDHBKO cells experienced substantial death within 24 to 48 hours, at which point the differences in the growth curves were already very significant.

      (13) Move most of the Supplementary tables into an Excel file - so values can be easily accessed.

      We have compiled the tables into an Excel file and submitted it along with the revised manuscript as supplementary material.

      (14) Consider changing colors to more appealing- especially jarring is a bright blue, red, black combination on many bar graphs.

      We have adjusted the color scheme of the figures (especially the bar graphs) in the paper, and have submitted them with the revised manuscript.

      (15) Double check y-axis on multiple graphs it says "mM".

      We have checked y-axis, the unit (mM) is correct.

      (16) Instead TCA cycle use the TCA cycle.

      In the revised manuscript, TCA cycle is used.

    1. eLife Assessment

      This valuable study aims to determine mechanisms underlying breast cancer initiation and tumour progression. The manuscript includes a solid set of transcriptomic and proteomic datasets from tumour samples and examines mitochondrial function within the tumours. While the underlying mechanisms linking expression changes to functional effects remain speculative. This paper provides a resource for researchers working on breast cancer and/or HER2-driven bioenergetics changes.

    2. Reviewer #1 (Public review):

      Summary:

      In this manuscript, Frangos at al. used a transcriptomic and proteomic approach to characterise changes in HER2-driven mammary tumours compared to healthy mammary tissue in mice. They observed that mitochondrial genes, including OXPHOS regulators, were among the most down-regulated genes and proteins in their datasets. Surprisingly, these were associated with higher mitochondrial respiration, in response to a variety of carbon sources. In addition, there seems to be a reduction in mitochondrial fusion and an increase in fission in tumour tissues compared to healthy tissues.

      Strengths:

      The data are clearly presented and described.

      The author reported very similar trends in proteomic and transcriptomic data. Such approaches are essential to have a better understanding of the changes in cancer cell metabolism associated with tumorigenesis.

      The authors provided a direct link between HER2 inhibition and OXPHOS, strengthening the mechanistic aspect of the work.

      Weaknesses:

      The manuscript would have benefited from more ex-vivo approaches to further dissect mechanistic links and resolve the contradiction of elevated respiration with reduced expression of most associated proteins (but these points are clearly articulated in the discussion).

      The results presented support the authors' conclusions, and limitations are addressed in the discussion. This work will likely impact the progression of the field, and the provided data will benefit the scientific community.

      Comments on revisions:

      The authors addressed all my concerns.

    3. Reviewer #2 (Public review):

      Frangos et al present a set of studies aiming to determine mechanisms underlying initiation and tumour progression. Overall, this work provides some useful datasets, further establishing mitochondrial dysfunction during the cellular transformation process.

      A key strength is the coordinated analysis of transcriptomics and proteomics from tumour samples derived from a Neu-dependent mouse model for breast cancer. This analysis provides rigorous datasets that show robust patterns, including down-regulation across many components of mitochondrial OXPHOS that were generally consistent at both the mRNA and protein level. Parallel analysis of corresponding tumour samples thereby clearly shows the opposite trend of increased mitochondrial function, which is unexpected. As such, this work further establishes altered mitochondrial phenotypes in tumour contexts and further illustrates that mitochondrial function is not necessarily always tightly correlated with mitochondrial gene expression patterns.

      Several key weaknesses remain. It remains unclear how increased mitochondrial function is being sustained despite wide decreases in mRNA and protein levels of OXPHOS components. In terms of mechanism, the study confirmed that pharmacologic EGFR inhibition decreases OXPHOS in a EGFR-dependent breast cancer line. However, it remains unclear if the cell culture system recapitulates other key observations of the tumour model (namely decreased expression with increased function).

      Therefore, the mechanistic basis of increased mitochondrial function in light of decreased mitochondrial content remains speculative, as does the role of these changes for tumour initiation or progression.

      Comments on revisions:

      We agree with the overall findings of the study and appreciate that the claims in text and title have been appropriately toned down.

      As additional suggestions eg for presentation, many of the graphics/labels are still too small to be useful. It would be interesting to see if this cell line is similar to the tumours in terms of all the phenotypes. The lapatinib experiment was good. I wonder how quick this drug affects the mitochondria. Also it would be interesting to see if these cells have higher OXPHOS than other non-transformed breast epithelial cells.

      The WB on oxphos components is good with ab110413 but this looks like many subunits are detected so this should be made clear.

    4. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      In this manuscript, Frangos et al. used a transcriptomic and proteomic approach to characterise changes in HER2-driven mammary tumours compared to healthy mammary tissue in mice. They observed that mitochondrial genes, including OXPHOS regulators, were among the most down-regulated genes and proteins in their datasets. Surprisingly, these were associated with higher mitochondrial respiration, in response to a variety of carbon sources. In addition, there seems to be a reduction in mitochondrial fusion and an increase in fission in tumours compared to healthy tissues.

      Strengths:

      The data are clearly presented and described.

      The author reported very similar trends in proteomic and transcriptomic data. Such approaches are essential to have a better understanding of the changes in cancer cell metabolism associated with tumourigenesis.

      Weaknesses:

      (1) This study, despite being a useful resource (assuming all the data will be publicly available and not only upon request) is mainly descriptive and correlative and lacks mechanistic links.

      We appreciate this point. While the primary goal of our study was to assess mitochondrial adaptations with HER2-driven tumorigenesis, we agree strengthening the mechanistic interpretation would improve the impact of the data. To address this, we have provided experiments demonstrating HER2 inhibition in NF639 cells with lapatinib supresses respiratory capacity, directly supporting the interpretation that HER2 activity regulates respiratory function (Figure 10). We have expanded the discussion appropriately (lines 378-394). Both raw RNA-seq and proteomic data were deposited through GEO and the PRIDE repositories (accession numbers included in Data Availability Statement).

      (2) It would be important to determine the cellular composition of the tumour and healthy tissue used. Do the changes described here apply to cancer cells only or do other cell types contribute to this?

      We thank the reviewer for this suggestion; we have added experiments that have directly addressed this concern.

      Cell type composition analysis by immunofluorescence was added (Figure 6) where we quantified epithelial, mesenchymal, endothelial, immune and stromal populations in our benign mammary tissue and tumor samples. We found no major shift in the dominant cell types that would confound transcriptomic data in whole tissues.

      We integrated immunofluorescence data with a publicly available scRNA-seq dataset from human breast tumors which allowed us to estimate cell-type-specific expression of OXPHOS genes in our own samples. Despite the possibility of species differences, this is the only dataset of its kind, and we used this to generate an estimate of cell type weighted OXPHOS mRNA expression (Figure 6). This revealed that epithelial cells are likely the dominant contributors to OXPHOS gene expression for CIIV. All calculations are delineated in the Methods section.

      (3) Are the changes in metabolic gene expression a consequence of HER2 signalling activation? Ex-vivo experiments could be performed to perturb this pathway and determine cause-effects.

      Thank you for this suggestion – we have included an experiment directly testing this concept. We assessed mitochondrial respiration in NF639 HER2-driven mammary tumor epithelial cells in the presence or absence of the well-described dual tyrosine kinase inhibitor lapatinib. Lapatinib reduced basal, CI-linked and CI+II linked respiration without compromising mitochondrial integrity or coupling, demonstrating that HER2 activation regulates respiration in our model. This data is presented in Figure 10, and a new section has been added to the discussion describing the implications of this finding in the context of the current literature (lines 378-394).

      (4) The data of fission/fusion seem quite preliminary and the gene/protein expression changes are not so clear cut to be a convincing explanation that this is the main reason for the increased mitochondria respiration in tumours.

      We agree mitochondrial morphology and dynamics alone cannot fully account for the observed respiratory phenotype – this was emphasized in the discussion but has since been further clarified (lines 365-377). We retained the TEM and dynamics gene/protein data because they do support morphological differences consistent with enhanced fission. However, we have revised the tone of our interpretation to more explicitly acknowledge that these findings are correlative, and the updated discussion now emphasizes that the increased respiratory capacity in tumors is likely driven by multiple converging mechanisms.

      Reviewer #2 (Public review):

      Frangos et al present a set of studies aiming to determine mechanisms underlying initiation and tumour progression. Overall, this work provides some useful insights into the involvement of mitochondrial dysfunction during the cellular transformation process. This body of work could be improved in several possible directions to establish more mechanistic connections.

      (5) The interesting point of the paper: the contrast between suppressed ETC components and activated OXPHOS function is perplexing and should be resolved. It is still unclear if activated mitochondrial function triggers gene down-regulation vs compensatory functional changes (as the title suggests). Have the authors considered reversing the HER2-derived signals e.g. with PI3K-AKT-MTOR or ERK inhibitors to potentially separate the expression vs. functional phenotypes? The root of the OXPHOS component down-regulation should also be traced further, e.g. by probing into levels of core mitochondrial biogenesis factors. Are transcript levels of factors encoded by mtDNA also decreased?

      We appreciate this insight and agree that the discordance between mitochondrial content and function is fascinating and have addressed the concerns above in the following manner:

      - We have altered the title – we agree we cannot definitively say that the enhanced respiratory capacity observed is compensatory.

      - We have added experiments in NF639 cells in the presence of lapatinib, a tyrosine kinase inhibitor to interrogate whether HER2 is necessary for our functional outcome of interest – the enhanced respiratory capacity in the tumors. Lapatinib significantly suppressed respiration (Figure 10) demonstrating HER2 signaling directly regulates mitochondrial respiration.

      - We have expanded the discussion to provide further comment on potential explanations for increased respiratory function and low mitochondrial content.

      (6) The second interesting aspect of this study is the implication of mitochondrial activation in tumours, despite the downregulation of expression signatures, suggestive of a positive role for mitochondria in this tumour model. To address if this is correlative or causal, have the authors considered testing an OXPHOS inhibitor for suppression of tumorigenesis?

      Previous studies have eloquently highlighted that directly or indirectly inhibiting mitochondria can supress growth in HER2-driven breast cancer (PMID:31690671) or alternatively, amplification of mt-HER2 enhances tumorigenesis (PMID: 38291340). In many solid tumors, this is the concept of preclinical and clinical studies using IACS-010759 or similar inhibitors of OXPHOS which do suppress growth but have significant off target effects in healthy tissues (PMID: 36658425, 3580228We have expanded the discussion to ensure the reader is aware of these previous contributions and highlighted the importance of future work delineating the role of enhanced respiratory function in HER2-driven mammary cancer (lines 378-394).

      (7) A number of issues concerning animal/ tumour variability and further pathway dissection could be explored with in vitro approaches. Have the authors considered deriving tumourderived cell cultures, which could enable further confirmations, mechanistic drug studies and additional imaging approaches? Culture systems would allow alternative assessment of mitochondrial function such as Seahorse or flow cytometry (mitochondrial potential and ROS levels).

      We thank the reviewer for this suggestion – we have addressed this in part by using the NF639 HER2driven tumor epithelial line which demonstrated that HER2 regulates our observed respiratory response. Unfortunately, the addition of tumor derived cell cultures was not feasible or within the scope of our study. Animal and tumor variability has been clarified in the Methods section (lines 424-429). Mitochondrial respiration experiments were performed in paired tissue (benign and tumor from same mouse). Transcriptomic, proteomic and histological analyses were performed on tumors and benign samples from different mice due to tissue limitations.

      (8) The study could be greatly improved with further confirmatory studies, eg immunoblotting for mitochondrial components with parallel blots for phospho-signalling in the same samples. It would be interesting if trends could be maintained in tumour-derived cell cultures. It is notable that OXPHOS protein/transcript changes are more consistent (Figure 5, Supplementary Figure 4) than mitochondrial dynamics /mitophagy factors (Figure 8). Core regulatory factors in these pathways should be confirmed by conventional immunoblotting.

      We thank the reviewer for this thoughtful comment. While we agree that additional confirmatory studies can be valuable, due to tissue quantity constraints and the number of assays required for our multi-omics analysis, extensive additional blots were not feasible. However, we had sufficient protein to provide select OXPHOS proteins to verify the proteomic data (now provided in S-Fig.4H). Furthermore, we have plotted the fold change of genes and proteins detected in both datasets and added this to Figure 4 (4A, B), further highlighting the consistency between our transcriptomic and proteomic findings. We believe that the highly consistent and concordant nature of our datasets collectively provides strong support for our central objective - determining whether mitochondrial content and respiratory function correlate in HER2-driven mammary tumors. The reproducibility of OXPHOS-related changes reinforces the robustness of our observations. We also appreciate the reviewer’s insight that OXPHOS alterations appear particularly consistent. In response, we have edited the discussion to further emphasize this point, especially in relation to the distinctive pattern observed for Complex V, which showed greater preservation relative to Complexes I–IV across several methods (lines 348-364). We comment on how this stoichiometric shift may contribute to intrinsic respiratory activation despite reduced mitochondrial content.

      Recommendations for the authors:

      Reviewer #2 (Recommendations for the authors):

      Further Minor points.

      (9) It would be helpful to know further details regarding the source of the tumour samples, particularly for the proteomics (N=5) and transcriptomics (N=6) datasets, since the exact timepoint of tissue harvest and number of tumours/mouse varied, according to the methods section. Were all samples from the omics studies from different mice (ie 11 mice)? B4 and B6 seem like outliers in mitochondrial transcriptomes. Are these directly paired eg with T4 and T6? Are the side-by-side pairs of Ben and Tum samples for blots in Figure 1 and Supplementary Figure 1 from the same mouse.

      This has been clarified in the Methods section (lines 424-429). Mitochondrial respiration experiments were performed in paired tissue (benign and tumor from same mouse). Transcriptomic, proteomic and histological analyses were performed on tumors and benign samples from different mice due to tissue limitations.

      (10) Further references and details are needed to support the methodology of the mitochondrial function tests (eg. nutrients vs pairing with complexes). What was the time point of nutrient supplementation? It would seem that the lipid substrates should take longer to activate OXPHOS than pyruvate/malate or succinate. Is this the case? Is there speculation as to why succinate supplementation is much more active than pyruvate+malate? What is +MD in Figure 6? The rationale for pooling data for Figure 7A is unclear since the categories appear to overlap: (pyruvate, malate, ADP) vs. (palmitoyl-carnitine, malate, ADP).

      Thank you for this comment. We have expanded the methods (lines 515-531) to provide additional detail on the mitochondrial respiration protocol. Briefly, permeabilized tissues were exposed to substrates delivered at supraphysiological concentrations in a sequential protocol lasting ~30–60 minutes. Under these conditions, mitochondrial respiration reflects the maximal capacity to utilize each substrate rather than the physiological time course of substrate mobilization or uptake that would occur in vivo with the influence of blood flow and transport/substrate availability limitations.

      (11) Many of the figures were blurry (Figure 1F, 2B) or had labels that were too small to be effective (Figures 1G, H, 2D-G, 3E-G, 5E-I, 7C, 8B).

      The font size of figure labels has been increased where possible and all figures have been exported to maximize resolution.

    1. eLife Assessment

      This study presents an important methodological advance-Liver-CUBIC combined with multicolor metallic nanoparticle perfusion-that enables high-resolution 3D visualization of the liver's complex multi-ductal architecture. The identification of the Periportal Lamellar Complex (PLC) as a novel perivascular structure with distinct cellular composition and low-permeability characteristics is convincing, supported by rigorous imaging data. The observed scaffolding role during fibrosis offers intriguing biological insights, though the functional claims would benefit from direct experimental validation.

    2. Reviewer #1 (Public review):

      Summary:

      In this manuscript, Chengjian Zhao et al. focused on the interactions between vascular, biliary, and neural networks in the liver microenvironment, addressing the critical bottleneck that the lack of high-resolution 3D visualization has hindered understanding of these interactions in liver disease.

      Strengths:

      This study developed a high-resolution multiplex 3D imaging method that integrates multicolor metallic compound nanoparticle (MCNP) perfusion with optimized CUBIC tissue clearing. This method enables the simultaneous 3D visualization of spatial networks of the portal vein, hepatic artery, bile ducts, and central vein in the mouse liver. The authors reported a perivascular structure termed the Periportal Lamellar Complex (PLC), which is identified along the portal vein axis. This study clarifies that the PLC comprises CD34<sup>+</sup>Sca-1<sup>+</sup> dual-positive endothelial cells with a distinct gene expression profile, and reveals its colocalization with terminal bile duct branches and sympathetic nerve fibers under physiological conditions.

      Comments on revisions:

      The authors very nicely addressed all concerns from this reviewer. There are no further concerns and comments.

    3. Reviewer #3 (Public review):

      Xu, Cao and colleagues aimed to overcome the obstacles of high-resolution imaging of intact liver tissue. They report successful modification of the existing CUBIC protocol into Liver-CUBIC, a high-resolution multiplex 3D imaging method that integrates multicolor metallic compound nanoparticle (MCNP) perfusion with optimized liver tissue clearing, significantly reducing clearing time and enabling simultaneous 3D visualization of the portal vein, hepatic artery, bile ducts, and central vein spatial networks in the mouse liver. Using this novel platform, the researchers describe a previously unrecognized perivascular structure they termed Periportal Lamellar Complex (PLC), regularly distributed along the adult liver portal veins.<br /> Using available scRNAseq data, the authors assessed the CD34<sup>+</sup>Sca-1<sup>+</sup> cells' expression profile, highlighting mRNA presence of genes linked to neurodevelopment, bile acid transport, and hematopoietic niche potential. Different aspects of this analysis were then addressed by protein staining of selected marker proteins in the mouse liver tissue. Next, the authors addressed how the PLC and biliary system react to CCL4-induced liver fibrosis, implying PLC dynamically extends, acting as a scaffold that guides the migration and expansion of terminal bile ducts and sympathetic nerve fibers into the hepatic parenchyma upon injury.

      The work clearly demonstrates the usefulness of the Liver-CUBIC technique and the improvement of both resolution and complexity of the information, gained by simultaneous visualization of multiple vascular and biliary systems of the liver. The identification of PLC and the interpretation of its function represent an intriguing set of observations that will surely attract the attention of liver biologists as well as hepatologists. The importance of the CD34+/Sca1+ endothelial cell population and claims based on transcriptomic re-analysis require future assessment by functional experimental approaches to decipher the functional molecules involved in PLC formation, maintenance, and the involvement in injury response before establishing their role in biliary, arterial, and neural liver systems.

      Strengths:

      The authors clearly demonstrate an improved technique tailored to the visualization of the liver vasulo-biliary architecture in unprecedented resolution.<br /> This work proposes a new morphological feature of adult liver facilitating interaction between the portal vein, hepatic arteries, biliary tree, and intrahepatic innervation, centered at previously underappreciated protrusions of the portal veins - PLCs.

      Weaknesses:

      The importance of CD34+Sca1+ endothelial cell sub-population for PLC formation and function was not tested and warrants further validation.

      Comments on revisions:

      I appreciate the author's effort to revise the text so it more rigorously adheres to the presented evidence. Following a thorough read of the revised text, a few remaining minor issues were identified in the Discussion.

      (1) From where comes the hard evidence for PLC being the stem cell niche in the following sentence?<br /> for the two following statements:

      This suggests that the PLC may not only provide structural support but also serve as a perivascular stem cell niche specific to the portal region, potentially involved in hematopoiesis and tissue regeneration.

      The PLC serves as a directional scaffold for ductal growth, a specialized stem cell niche, and a potential site of neurovascular coupling.

      (2) In the following paragraph, I lack references to the previously published evidence of liver innervation guidance mechanisms, such as the mesenchyme-mediated guidance (CD31- population) Gannoun et al., 2023 https://doi.org/10.1242/dev.201642, an important context for your finding.

      Further analysis showed significant upregulation of genes involved in neurodevelopment and axonal guidance in the CD34<sup>+</sup>Sca-1<sup>+</sup> cluster, along with activation of neuronal signaling pathways. Immunostaining confirmed the presence of TH<sup>+</sup> sympathetic nerve fibers wrapping around the PLC in a "beads-on-a-string" pattern (Fig. 6), consistent with a classic neurovascular unit(Adori et al., 2021). Previous studies have shown that sympathetic nerves enter the liver along collagen fibers of Glisson's capsule and interact with hepatic arteries, portal veins, and bile duct epithelium, supporting the PLC as a scaffold for intrahepatic neurovascular integration.

      (3) Several sentences have issues with a lack of space between words.

    4. Author response:

      The following is the authors’ response to the previous reviews

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      In this manuscript, Chengjian Zhao et al. focused on the interactions between vascular, biliary, and neural networks in the liver microenvironment, addressing the critical bottleneck that the lack of high-resolution 3D visualization has hindered understanding of these interactions in liver disease.

      Strengths:

      This study developed a high-resolution multiplex 3D imaging method that integrates multicolor metallic compound nanoparticle (MCNP) perfusion with optimized CUBIC tissue clearing. This method enables the simultaneous 3D visualization of spatial networks of the portal vein, hepatic artery, bile ducts, and central vein in the mouse liver. The authors reported a perivascular structure termed the Periportal Lamellar Complex (PLC), which is identified along the portal vein axis. This study clarifies that the PLC comprises CD34<sup>+</sup>Sca-1<sup>+</sup> dual-positive endothelial cells with a distinct gene expression profile, and reveals its colocalization with terminal bile duct branches and sympathetic nerve fibers under physiological conditions.

      Comments on revisions:

      The authors very nicely addressed all concerns from this reviewer. There are no further concerns or comments.

      We sincerely thank the reviewer for the positive evaluation of the revised manuscript.

      Reviewer #2 (Public review):

      Summary:

      The present manuscript of Xu et al. reports a novel clearing and imaging method focusing on the liver. The Authors simultaneously visualized the portal vein, hepatic artery, central vein, and bile duct systems by injected metal compound nanoparticles (MCNPs) with different colors into the portal vein, heart left ventricle, vena cava inferior and the extrahepatic bile duct, respectively. The method involves: trans-cardiac perfusion with 4% PFA, the injection of MCNPs with different colors, clearing with the modified CUBIC method, cutting 200 micrometer thick slices by vibratome, and then microscopic imaging. The Authors also perform various immunostaining (DAB or TSA signal amplification methods) on the tissue slices from MCNP-perfused tissue blocks. With the application of this methodical approach, the Authors report dense and very fine vascular branches along the portal vein. The authors name them as 'periportal lamellar complex (PLC)' and report that PLC fine branches are directly connected to the sinusoids. The authors also claim that these structures co-localize with terminal bile duct branches and sympathetic nerve fibers and contain endothelial cells with a distinct gene expression profile. Finally, the authors claim that PLC-s proliferate in liver fibrosis (CCl4 model) and act as scaffold for proliferating bile ducts in ductular reaction and for ectopic parenchymal sympathetic nerve sprouting.

      Strengths:

      The simultaneous visualization of different hepatic vascular compartments and their combination with immunostaining is a potentially interesting novel methodological approach.

      Weaknesses:

      This reviewer has some concerns about the validity of the microscopic/morphological findings as well as the transcriptomics results, and suggests that the conclusions of the paper may be critically viewed. Namely, at this point, it is still not fully clear that the 'periportal lamellar complex (PLC)' that the Authors describe really exists as a distinct anatomical or functional unit or these are fine portal branches that connect the larger portal veins into the adjacent sinusoid. Also, in my opinion, to identify the molecular characteristics of such small and spatially highly organized structures like those fine radial portal branches, the only way is to perform high-resolution spatial transcriptomics (instead of data mining in existing liver single cell database and performing Venn diagram intersection analysis in hepatic endothelial subpopulations). Yet, the existence of such structures with a distinct molecular profile cannot be excluded. Further research with advanced imaging and omics techniques (such as high resolution volume imaging, and spatial transcriptomics/proteomics) are needed to reproduce these initial findings.

      We thank the reviewer for the thoughtful and constructive comments. In response to the reviewer’s concerns regarding the anatomical and molecular definition of the periportal lamellar complex (PLC), we have further clarified the scope and methodological boundaries of the present study in the revised manuscript.

      Regarding the key question raised by the reviewer—namely, whether the PLC represents an independent anatomical or functional unit, or merely small portal venous branches connecting larger portal veins to adjacent sinusoids—we provide below a more detailed explanation of the criteria used to define the PLC in this study. The identification of the PLC is primarily based on periportal structures that can be reproducibly recognized by three-dimensional imaging across multiple mice, exhibiting a relatively consistent spatial distribution within the periportal region. The PLC could be stably observed across different MCNP dye color assignments and independent experimental batches. In addition, three-dimensional CD31 immunofluorescence consistently revealed vascular-associated signal distributions in the same periportal region, indirectly supporting its spatial association with the periportal vascular system.

      At the morphological level, the PLC appears as a periportal vasculature-associated structure distributed around the main portal vein trunk and maintains a relatively consistent spatial proximity to portal veins, bile ducts, and neural components in three-dimensional space. This highly conserved spatial organization across multiple tissue systems supports the anatomical positioning of the PLC as a relatively distinct structural tissue unit within the periportal region.

      The present study primarily focuses on a descriptive characterization of the three-dimensional anatomical organization and spatial relationships of the PLC based on volumetric imaging and vascular labeling strategies. As a complementary exploratory analysis, we reanalyzed endothelial cell populations potentially associated with the PLC using existing liver single-cell transcriptomic datasets. This analysis was intended to provide molecular-level information consistent with the structural observations and to offer preliminary clues to its potential biological functions, rather than to independently define the PLC at the spatial level or to functionally validate it.

      We fully acknowledge the value of spatial transcriptomic and spatial proteomic technologies in revealing molecular heterogeneity within tissue architecture. However, under current technical conditions, these approaches are largely dependent on thin tissue sections and are limited by spatial resolution and signal mixing effects, which still pose challenges for resolving periportal structures with pronounced three-dimensional continuity, such as the PLC. In the future, further integration of high-resolution volumetric imaging with spatial omics technologies may enable a more refined understanding of the molecular features and potential functions of the PLC at higher spatial resolution.

      Reviewer #3 (Public review):

      Summary:

      In the revised version of the manuscript authors addressed multiple comments, clarifying especially the methodological part of their work and PLC identification as a novel morphological feature of the adult liver portal veins. Tet is now also much clearer and has better flow.

      The additional assessment of the smartSeq2 data from Pietilä et al., 2025 strengthens the transcriptomic profiling of the CD34+Sca1+ cells and the discussion of the possible implications for the liver homeostasis and injury response. Why it may suffer from similar bias as other scRNA seq datasets - multiple cell fate signatures arising from mRNA contamination from proximal cells during dissociation, it is less likely that this would happen to yield so similar results.

      Nevertheless, a more thorough assessment by functional experimental approaches is needed to decipher the functional molecules and definite protein markers before establishing the PLC as the key hub governing the activity of biliary, arterial, and neuronal liver systems.

      The work does bring a clear new insight into the liver structure and functional units and greatly improves the methodological toolbox to study it even further, and thus fully deserves the attention of the Elife readers.

      Strengths:

      The authors clearly demonstrate an improved technique tailored to the visualization of the liver vasulo-biliary architecture in unprecedented resolution.

      This work proposes a new morphological feature of adult liver facilitating interaction between the portal vein, hepatic arteries, biliary tree, and intrahepatic innervation, centered at previously underappreciated protrusions of the portal veins - the Periportal Lamellar Complexes (PLCs).

      Weaknesses:

      The importance of CD34+Sca1+ endothelial cell subpopulation for PLC formation and function was not tested and warrants further validation.

      We thank the reviewer for the careful and constructive comments regarding the functional validation of cell populations associated with the PLC. The central aim of this study is to establish and validate a novel volumetric imaging and vascular labeling strategy and to apply it to the periportal region of the liver, thereby revealing previously underappreciated structural organizational patterns at the three-dimensional level, rather than to perform a systematic functional validation of specific cellular subpopulations.

      We agree that the precise roles of the CD34<sup>+</sup>Sca-1<sup>+</sup> endothelial cell subpopulation in the formation and function of the periportal lamellar complex (PLC) have not been directly addressed through functional intervention experiments in the present study. Our conclusions are primarily based on three-dimensional imaging and spatial distribution analyses, which reveal a stable and consistent spatial association between this cell population and the PLC structure, but are not intended to independently support causal or functional inferences. The underlying functional mechanisms remain to be elucidated in future studies using genetic or functional perturbation approaches.

      In light of these considerations, we have further refined the relevant statements in the revised manuscript to more clearly define the functional scope and limitations of the current study in the Discussion section, and to avoid functional interpretations that extend beyond the direct support of the data. At the same time, we consider functional validation of the PLC to be an important and promising direction for future investigation.

      It should be emphasized that the present study is not primarily designed to provide direct functional validation, but rather to systematically characterize the three-dimensional structural features of the periportal lamellar complex (PLC) and its cellular associations using volumetric imaging and vascular labeling approaches. At this stage, we mainly provide spatial and histological evidence for the organizational relationship between the PLC structure and the CD34<sup>+</sup>Sca-1<sup>+</sup> endothelial cell population, while their specific roles in PLC formation and functional regulation await further investigation.

      Recommendations for the authors:

      Reviewer #2 (Recommendations for the authors):

      I highly appreciate the Authors' endeavors to improve the manuscript. I am enlisting those points (from my original review) where I still have further comments.

      (2) I would suggest this sentence:

      "...the liver has evolved a highly complex and densely organized ductal vascular-neuronal network in the body, consisting primarily of the portal vein system, central vein system, hepatic artery system, biliary system, and intrahepatic autonomic nerve network [6, 7]."

      We thank the reviewer for the valuable suggestion. We have revised the relevant sentence accordingly, and the revised wording is as follows:

      “The liver has evolved a highly complex and densely organized vascular–biliary–neural network, primarily composed of the portal venous system, central venous system, hepatic arterial system, biliary system, and the intrahepatic autonomic neural network.”

      (3) I suggest renaming 'clearing efficiency' to 'clearing time', and revise the last sentence like:

      '...The results showed that the average transmittance increased by 20.12% in 1mm-thick cleared tissue slices.'

      We thank the reviewer for this helpful suggestion. Accordingly, we have replaced the term “clearing efficiency” with “clearing time” and revised the final sentence to reflect this change. The revised wording is as follows:

      “The results showed that the average transmittance increased by 20.12% in cleared tissue slices with a thickness of 1 mm.”

      (4) While the dye perfusion was indeed on full lobe, FigS1F also seems to be rather a thick section instead of a full 3d reconstruction. This is OK, but please, be clear and specific about this in the respective part of the ms.

      We thank the reviewer for the careful review and detailed comments. We would like to clarify that Fig. S1F shows whole-lobe imaging of the mouse left liver lobe obtained after dye perfusion at the whole-liver scale, rather than an image derived from a thick tissue section. Although this image does not represent a three-dimensional reconstruction, it does reflect imaging of the entire left liver lobe at the macroscopic level.

      In addition, for the reviewer’s reference, we have provided in this response a representative image of a 200 μm-thick liver tissue section to directly illustrate the morphological differences between thick-section imaging and whole-lobe imaging. We note that the third and fourth panels in Fig. 1G of the main text already show local imaging results from 200 μm-thick sections; in contrast, the comparative image provided here presents a larger field of view and overall morphology. To avoid redundancy, this additional image is included solely for clarification in the present response and has not been incorporated into the revised manuscript or the supplementary materials.

      (11) Regarding the 'transmission quantification':

      'Regarding the comparative quantification of different clearing methods, as the reviewer noted, nearly all aqueous or organic solvent based clearing techniques can achieve relatively uniform transparency in 1 mm thick tissue sections, so differences at this thickness are limited.'

      So, based on all these, I think, measuring/comparisons of clearing efficacy in the present form are kind of pointless --- one may consider omitting this part.

      We thank the reviewer for the valuable comments. The purpose of the transmittance quantification in this study was not to provide a comprehensive comparison among different tissue-clearing methods, but rather to serve as a quantitative reference supporting the optimization of the Liver-CUBIC protocol. Accordingly, we have narrowed and clarified the relevant statements in the revised manuscript to define their scope and avoid overinterpretation.

      The revised text now reads as follows:

      “Importantly, Liver-CUBIC treatment did not induce significant tissue expansion (Figure 1B–D). In addition, quantitative transmittance measurements in 1-mm-thick cleared tissue slices showed an average increase of 20.12% (P < 0.0001; 95% CI: 19.14–21.09; Figure 1E).”

      Author response image 1.

      (16) It is OK, but please, indicate this clearly in the Methods/Results because in its present form it may be confusing for the reader: which color means what.

      We thank the reviewer for this helpful request for clarification. We agree that the previous wording may have caused confusion regarding the meaning of different MCNP colors. Accordingly, we have revised the Methods section and the relevant figure legends to clearly state that the color assignment of MCNP dyes is not fixed across different experiments or figures. The use of different colors serves solely for visualization and presentation purposes, facilitating the distinction of anatomical structures in multichannel and three-dimensional imaging, and does not indicate any fixed or intrinsic correspondence between a specific color and a particular vascular or ductal system. We believe that this clarification will help prevent misinterpretation and improve the overall clarity of the manuscript.

      (17) Still I think the hepatic artery is extremely shrunk, while the portal vein is extremely dilated. Please, note that in the referring figure (from Adori et al), hepatic artery and portal vein are ca 50 micrometers and 250 micrometers in diameter, respectively. In your figure, as I see, ca. 9-10 micrometers and 125 micrometers, respectively. This means 5x (Adori) vs. 13-14x differences (you). I would not say that this is necessarily problematic --- but may reflect some perfusion issues that may be good to consider.

      We thank the reviewer for the careful comparison and acknowledge the quantitative differences pointed out. Compared with the study by Adori et al., the diameter ratio between the hepatic artery and the portal vein in our images does indeed differ to some extent. We believe that this discrepancy primarily arises from methodological differences in imaging and analysis strategies between the two studies.

      In the work by Adori et al., periportal vasculature identification and three-dimensional segmentation were mainly based on 488 nm autofluorescence signals acquired from inverted tissues. This signal predominantly reflects the overall outline of periportal tissue regions rather than direct imaging of the vascular lumen itself. Consequently, the measured “vessel diameter” largely represents a spatial domain delineated by surrounding periportal structures, and does not necessarily correspond to the actual or functional luminal diameter of the vessel.

      In contrast, the present study employed fluorescent MCNP dye perfusion under low perfusion pressure, combined with tissue clearing and three-dimensional optical imaging. Under these experimental conditions, the measured vessel diameters more closely reflect the perfusable luminal space of vessels in a fixed state, rather than their maximally dilated diameter, and are not defined by the morphology of surrounding tissues. This distinction is particularly relevant for the hepatic artery: as a high-resistance, smooth muscle–rich vessel, its diameter is highly sensitive to perfusion pressure and post-excision changes in vascular tone. In comparison, the portal vein exhibits greater compliance and is relatively less affected by these factors.

      Based on these methodological differences, the observation of relatively smaller apparent hepatic arterial diameters—and consequently a higher arterial-to-portal vein diameter ratio—under dye perfusion–based optical imaging conditions is an expected outcome. Importantly, the primary focus of the present study is the identification and characterization of the periportal lamellar complex (PLC) as a three-dimensional lamellar tissue structure that can be stably and reproducibly recognized across different samples and imaging conditions, rather than absolute comparisons of vascular diameters.

      (21) After the presented documentation, I still have some concerns that the 'periportal lamellar complex (PLC)' that the Authors describe is really a distinct anatomical or functional unit. The confocal panel in Fig. 4F is nice and high quality. However, as far as I see, it shows that CD34+/Sca-1+ immunostaining is not specific for the presumptive PLCs in the peri-portal region. Instead, Sca-1 immunoreactivity is highly abundant also in the midzone --- to which the supposed PLCs do not extend, according to the cartoon shown in panel D, same figure. Notably, this questions also the specificity of the single cell analysis.

      We thank the reviewer for this detailed and important comment regarding the specificity of CD34<sup>+</sup>/Sca-1<sup>+</sup> markers and the definition of the periportal lamellar complex (PLC).

      It should be emphasized that the PLC is not defined on the basis of any single molecular marker, but rather by a reproducible periportal lamellar anatomical structure consistently revealed by three-dimensional imaging across multiple samples. The co-expression of CD34 and Sca-1 is interpreted within this clearly defined anatomical context and is used to characterize the molecular features of endothelial cells associated with the PLC structure.

      As shown in Fig. 4F, the co-expression of CD34 and Sca-1 delineates a continuous, lamellar endothelial structure surrounding the portal vein. In contrast, outside the periportal region—including the midlobular areas—Sca-1 or CD34 expression can also be detected, but these signals appear scattered and discontinuous, lacking an organized lamellar topology.

      In the single-cell transcriptomic analysis, we treated CD34<sup>+</sup>/Sca-1<sup>+</sup> endothelial cells as an operational population to explore molecular features that may be enriched in the microenvironment of the periportal lamellar complex (PLC). Importantly, this analysis was intended to provide molecular clues associated with the PLC, rather than to precisely assign spatial locations or identities to individual cells.

      Occasional isolated Sca-1<sup>+</sup> signals detected outside the periportal region do not affect the anatomical definition of the PLC, nor do they alter the interpretation of the single-cell analysis. These analyses serve to provide supportive and exploratory molecular information for the structural identification of the PLC, rather than constituting decisive spatial evidence.

      (23) '....In the manuscript, we have carefully stated that this analysis is exploratory in nature and have avoided overinterpretation. In future studies, high-resolution spatial omics approaches will be invaluable for more precisely delineating the molecular characteristics of these fine structures.'

      I do not find these statements either in the Discussion or in the Results. I must reiterate my opinion that the applied methodical approach in the single cell transcriptomics part has severe limitations, and the readers must be aware of this.

      We thank the reviewer for this further comment. We understand and acknowledge the reviewer’s concerns regarding the methodological limitations of single-cell transcriptomic analyses, and we agree that these limitations should be clearly communicated to readers in the main text.

      We acknowledge that in the previous version of the manuscript, the exploratory nature of the single-cell transcriptomic analysis and its methodological boundaries were discussed only in the response to reviewers and were not explicitly stated in the manuscript itself. We thank the reviewer for pointing out this omission. In the revised manuscript, we have now added explicit clarifications in the main text to prevent potential overinterpretation of these results.

      In the present study, our primary effort is focused on the descriptive characterization of the three-dimensional anatomical organization and spatial relationships of the PLC using volumetric imaging and vascular labeling strategies. As a complementary exploratory analysis, we reanalyzed existing liver single-cell transcriptomic datasets to examine endothelial cell populations exhibiting PLC-associated features, and performed differential gene expression and Gene Ontology enrichment analyses. Importantly, these results are intended to provide molecular-level support for the structural identification of the PLC and to offer preliminary insights into its potential biological functions. Accordingly, we have narrowed the presentation and interpretation of the single-cell analysis in both the Results and Discussion sections of the revised manuscript.

      In addition, we have expanded the Discussion to address the limitations of current spatial transcriptomic approaches in validating a continuous three-dimensional structure such as the PLC. Most existing spatial transcriptomic methods rely on two-dimensional tissue sections of 8–10 μm thickness, whereas identification of the PLC depends on three-dimensional imaging of tissue volumes with thicknesses of ≥200 μm, making reliable reconstruction of its spatial continuity from single sections challenging. Furthermore, because each spatial transcriptomic capture spot often encompasses multiple adjacent cells, signal mixing effects further limit precise resolution of specific periportal microstructures.

      Overall, we agree with the reviewer’s central point that the limitations of single-cell transcriptomic analyses should be clearly understood by readers. By explicitly clarifying the methodological boundaries and refining the related statements in the main text, we believe this concern has now been adequately addressed in the revised manuscript. We thank the reviewer for identifying this omission, which has helped to improve the rigor and clarity of the study.

      Reviewer #3 (Recommendations for the authors):

      (1) While interesting observations, suitable for discussion, the following sections are speculations, given that no functional characterization of PLC importance has been performed yet. This is the most felt when commenting on the role in hematopoiesis, which transiently takes place in the liver during embryogenesis (Khan et al 2016) but ceases to exist after ligation of the umbilical inlet. Adult Liver hematopoiesis remains controversial, and more solid evidence would need to be presented to support its existence in PLC regions.

      265 - These findings suggest that the Periportal Lamellar Complex (PLC) is not only a morphologically and spatially distinct, low-permeability vascular unit surrounding the portal vein, but also likely serves as a critical nexus connecting the portal vein, hepatic artery, and liver sinusoids. Thus, the PLC constitutes a key node within the interactive vascular network of the mouse liver.

      We thank the reviewer for the comments and suggestions regarding the potential functional interpretation of the periportal lamellar complex (PLC), particularly its possible association with hematopoietic function. We would like to clarify that the statement on page 265 was intended solely to describe the structural characteristics and spatial organization of the PLC within the periportal vascular network. Specifically, the original wording aimed to summarize the morphological features of the PLC and its spatial relationships among the portal vein, hepatic artery, and hepatic sinusoids.

      Nevertheless, to minimize potential misunderstanding, we have revised this section to avoid unnecessary functional implications. The revised text now reads:

      “These results suggest that the periportal lamellar complex (PLC) is a morphologically and spatially distinct vascular structure that surrounds the portal vein and may serve as a key organizational node coordinating the spatial relationships among the portal vein, hepatic artery, and hepatic sinusoids. Accordingly, the PLC represents an important structural element within the interactive vascular network of the mouse liver.”

      This revision preserves the structural significance of the PLC while avoiding overinterpretation of its functional roles.

      (2) The same is true also for this section, following Figure 3 - no functional experiment tested this. For example, diphtheria toxin is expressed in the CD34+Sca1+ population. Or at least a careful mapping of the developing liver, which would indicate if the PLC precedes or follows the BD development.

      356 as a spatial positional cue guiding bile duct growth and branching but also as a regulatory node involved in coordinating bile drainage from the hepatic lobule into the biliary network.

      To avoid potential misunderstanding, we have further refined and revised the statements in the manuscript regarding the functional interpretation of the periportal lamellar complex (PLC) and its relationship to bile duct development. We agree that cell ablation strategies are of great importance for functional validation studies. However, it should be noted that CD34 and Sca-1 are relatively broadly expressed markers during liver development, labeling multiple endothelial, mesenchymal, and progenitor cell populations, and their expression is not restricted to the PLC. Owing to this broad expression pattern, ablation of CD34<sup>+</sup>Sca-1<sup>+</sup> cell populations would likely exert widespread effects on vascular and stromal structures, thereby complicating the distinction between direct PLC-specific effects and secondary developmental alterations. As such, this strategy may present technical limitations for specifically dissecting the role of the PLC in bile duct development. At the same time, given that the primary objective of this study is the systematic characterization of the three-dimensional anatomical features and spatial organization of the PLC, we have correspondingly revised the manuscript to restrict statements regarding the relationship between the PLC and bile ducts to spatial associations supported by the current data. Specifically, our results show that primary bile ducts run along the main portal vein trunk, secondary bile ducts exhibit directed branching toward the PLC region, and terminal bile duct branches tend to spatially cluster in the vicinity of the PLC, thereby forming a reproducible periportal spatial arrangement. Based on these observations, the PLC delineates a relatively conserved anatomical microenvironment within the portal region, whose spatial position is closely associated with the organization and terminal distribution of the intrahepatic bile duct network.

      We believe that these revisions more accurately reflect the experimental evidence and the defined scope of the present study.

      (3) The following statement ought to be rephrased or skipped, considering that CD34 and Sca1 (Ly6a) are markers of periportal endothelial cells (Pietilä et al., 2025, Gómez-Salinero et al., 2022) and as shown by the authors in their own Fig. 6D. In this context and the context of the CCL4 experiments, a "simple" proliferative progenitor portal vein endothelial cell phenotype, suggested also by the presence of DLL4 (Fig5A) and JAG1 (Pietilä et al., 2025) (Benedito et al., 2009) ought to be considered.

      409 Notably, CD34 and Sca-1 (Ly6a) were co-expressed exclusively within PLC structures surrounding the portal vein, but absent from central vein ECs and midzonal LSECs (Figure 4F).

      We thank the reviewer for pointing out the potential imprecision in this wording. We agree that both CD34 and Sca-1 (Ly6a) are well-established markers of periportal endothelial cells, as previously reported (Pietilä et al., 2025; Gómez-Salinero et al., 2022), and as also illustrated in Fig. 4F of our study.

      Accordingly, the original statement suggesting that CD34 and Sca-1 are co-expressed exclusively within the PLC structure may indeed represent an overinterpretation. Following the reviewer’s suggestion, we have revised the relevant text on page 409 by removing the exclusive phrasing (“only in”) and by emphasizing instead that CD34<sup>+</sup>Sca-1<sup>+</sup> endothelial cells are enriched in periportal regions associated with the PLC, rather than being specific to or confined within the PLC.

      In addition, in the context of the CCl<sub>4</sub>-induced liver fibrosis model, we agree with the reviewer that the observed expression of DLL4 and JAG1 under fibrotic conditions is more appropriately interpreted as reflecting an activated or proliferative periportal endothelial progenitor–like phenotype, rather than defining a novel endothelial lineage. The corresponding statements in the revised manuscript have been adjusted accordingly.

      (4) Again, these concluding sentences are based on correlative evidence of mRNA expression and literature but not experimental evidence.

      436 These findings suggest that this unique endothelial cell subset in the periportal region may possess dual regulatory functions in both metabolic and hematopoietic modulation

      441 results suggest that PLC endothelial cells may not only regulate periportal microcirculatory blood flow but also help establish a specialized microenvironment that potentially supports periportal hematopoietic regulation, contributing to stem cell recruitment, vascular homeostasis, and tissue repair.

      We thank the reviewer for this thoughtful comment. We agree that these statements are primarily based on transcriptomic correlation analyses and support from previous literature, rather than direct functional experimental evidence.

      Accordingly, in the revised manuscript, we have appropriately toned down and adjusted the relevant concluding statements to more accurately reflect their inferential nature. The revised wording emphasizes associations and potential involvement, rather than definitive functional roles. These changes preserve the overall scientific interpretation while aligning the level of inference more closely with the available evidence.

      The revised text now reads:

      “Finally, we found that the main trunk of the PLC is primarily composed of CD34<sup>+</sup>Sca-1<sup>+</sup>CD31<sup>+</sup> endothelial cells (Fig. 4J). These CD34<sup>+</sup>Sca-1<sup>+</sup> double-positive cells are mainly distributed in the basal region of the PLC structure and exhibit molecular features associated with hematopoiesis. Taken together, these results suggest that PLC endothelial cells may contribute to the establishment of a local microenvironment related to periportal hematopoietic regulation and may play potential roles in stem cell recruitment and maintenance of vascular homeostasis.”

      (5) The following part is speculative and based on re-analysis from the dataset that was gathered after 6 more weeks of CCL4 treatment (12weeks Su et al., 2021), then in the linked experiments from the manuscript. And should be moved to discussion or removed.

      504 Moreover, single-cell transcriptomic re-analysis revealed significant upregulation of bile duct-related genes in the CD34<sup>+</sup>Sca-1<sup>+</sup> endothelium of PLC in fibrotic liver, with notably high expression of Lgals1 (Galectin-1) and Hgf (Figure 5G). Previous studies have shown that Galectin-1 is absent in normal liver parenchyma but highly expressed in intrahepatic cholangiocarcinoma (ICC), correlating with tumor dedifferentiation and invasion (Bacigalupo, Manzi, Rabinovich, & Troncoso, 2013; Shimonishi et al., 2001). Additionally, hepatocyte growth factor (HGF), particularly in combination with epidermal growth factor (EGF) in 3D cultures, promotes hepatic progenitor cells to form bile duct-polarized cystic structures (N. Tanimizu, Miyajima, & Mostov, 2007). Together, these findings suggest the PLC endothelium may act as a key regulator of bile duct branching and fibrotic microenvironment remodeling in liver fibrosis.

      Collectively, our results demonstrate that the PLC, situated between the portal vein and periportal sinusoidal endothelium, constitutes a critical vascular microenvironmental unit. It may not only colocalize with bile duct branches under normal physiological conditions, but also through its basal CD34<sup>+</sup>Sca-1<sup>+</sup> double-positive endothelial cells, potentially orchestrate bile duct epithelial proliferation, branching morphogenesis, and bile acid transport homeostasis via multiple signaling pathways. Particularly during liver fibrosis progression, the PLC exhibits dynamic structural extension, serving as a spatial scaffold facilitating terminal bile duct migration and expansion into the hepatic parenchyma (Figure 5H). These findings highlight the PLC endothelial cell population and the vascular-bile duct interface as key regulatory hubs in bile duct regeneration, tissue repair, and pathological remodeling, providing novel cellular and molecular insights for understanding bile duct-related diseases such as ductular reaction, cholangiocarcinoma, and cholestatic disorders, and offering potential targets for therapeutic intervention.

      We thank the reviewer for this careful and thought-provoking comment. We understand and agree with the reviewer’s assessment that this section involves a degree of inference, as the analysis is based on a re-analysis of a previously published single-cell transcriptomic dataset from a CCl<sub>4</sub>-induced liver fibrosis model (Su et al., 2021), rather than on experimental data directly generated in the present study.

      In response to the reviewer’s suggestion, we have carefully re-examined and revised the relevant paragraphs. Without altering the overall structure of the manuscript, we have appropriately moderated the wording to clarify that these results primarily describe the transcriptional features of PLC-associated CD34<sup>+</sup>Sca-1<sup>+</sup> endothelial cells under fibrotic conditions, and their associations with bile duct–related gene expression, rather than providing direct functional evidence for their roles in bile duct branching or microenvironmental remodeling.

      In addition, we have explicitly clarified in the main text the data source and methodological limitations of the single-cell transcriptomic analysis, and emphasized that these findings should be interpreted in conjunction with the spatial information revealed by three-dimensional imaging. Through these revisions, we aim to retain the value of this analysis in providing complementary molecular insight into PLC characteristics, while avoiding potential over-interpretation of its functional implications.

      Formal suggestions:

      (6) The following sentence would benefit from being more clearly written.

      263 - The formation of PLC structures in the adventitial layer may participate in local blood flow regulation, maintenance of microenvironmental homeostasis.

      We thank the reviewer for this helpful suggestion. The sentence has been revised to improve clarity by correcting the parallel structure and refining the wording.

      The formation of PLC structures in the adventitial layer may participate in local blood flow regulation and the maintenance of microenvironmental homeostasis.

      (7) The following sentence is misleading as it implies cell sorting, and "subsetted" rather than "sorted" should be used.

      414 Based on this, we sorted CD34<sup>+</sup>Sca-1<sup>+</sup> endothelial populations from the total liver EC pool (Figure 4G).

      Thank you for your comment.

      We have revised the term as suggested. This avoids the misleading implication of physical sorting, as our operation was analytical subsetting of the target subpopulation.

      We appreciate your careful review.

      (8) Correct typos, especially in the results section related to Fig. 6. and formatting issues in the discussion.

      730 Morphologically, the PLC shares features with previously described telocytes (TCs)- 731 a recently identified class of interstitial cells in the liver observed via transmission electron

      We thank the reviewer for pointing out this textual error. In the submitted version, the sentence describing the morphological similarity between the PLC and previously reported telocytes was inadvertently interrupted due to a punctuation issue. This has now been corrected to ensure sentence integrity and consistent formatting.

    1. eLife Assessment

      This study now provides solid evidence for a role of EndoA3-mediated trafficking of ICAM-1 to the immune synapse with T cells. The study will be valuable to those studying cell-cell communication in the immune system, and opens additional questions regarding the mechanisms involved and how other adhesion ligands are regulated.

    2. Reviewer #1 (Public review):

      Summary:

      This study by Xu et al. investigates how clathrin-independent endocytosis in cancer cells influences T cell activation. Using a combination of biochemical approaches and imaging, the authors identify ICAM1, the ligand for the T cell integrin LFA-1, as a novel cargo of EndoA3-mediated endocytosis.

      The authors then explore the functional consequences of EndoA3 depletion in cancer cells on T cell function using cytokine measurements, surface marker analyses, cytotoxicity assays and imaging. Loss of EndoA3 results in reduced T cell cytokine production, while expression of activation and exhaustion markers such as TIM-3, PD-1, and CD137 remains largely unchanged. EndoA3 knockout is associated with reduced ICAM1 surface levels and increased ALCAM levels in cancer cells. Imaging experiments further reveal directional transport of ICAM1 toward the immunological synapse, seemingly slightly reduced ICAM1 levels at the synapse upon EndoA3 depletion and an enlarged contact area between T cells and cancer cells.

      Based on these observations, the authors propose a model in which EndoA3-mediated endocytosis and retrograde trafficking of ICAM1 (and ALCAM) supplies the immunological synapse with ligands for adhesion molecules. In the absence of EndoA3, T cells are suggested to compensate for suboptimal ICAM1 availability by enlarging the synaptic contact area, altering synapse architecture, leading to reduced cytokine secretion but modestly enhanced cytotoxicity.

      Overall, the study provides convincing evidence for a modulatory role of EndoA3-mediated endocytosis in regulating T cell-cancer cell interactions. However, the choice of cellular model systems, the limited number of biological replicates and insufficiently supported mechanistic interpretations weaken the manuscript and weaken the strength of its conclusions.

      Strengths:

      The authors employ a rigorous and innovative experimental strategy that convincingly identifies ICAM1 as a novel cargo of EndoA3-mediated endocytosis with convincing visualization of directional ICAM1 transport toward the immunological synapse. In addition, the study provides a comprehensive characterization of how EndoA3 depletion in cancer cells affects T cell cytokine production, activation, proliferation and cytotoxic function, representing a valuable contribution to our understanding of how membrane trafficking pathways in target cells can modulate immune responses.

      Comments on revised version:

      Thank you very much for submitting your revised manuscript. I appreciated your efforts to answer all of the reviewers questions. While in my opinion the manuscript truly improved I think there are still lingering questions, in particular regarding the following points:

      (1) Limited biological replication:

      The LB33-MEL system remains problematic, as also noted by other reviewers. While it clearly represents an improvement over highly derived model systems such as Jurkat or Raji cells, it nevertheless effectively restricts the study to a single biological replicate. In this context, it may be more appropriate to compare the chosen approach to more state-of-the-art systems, such as expression of HLA-A*02:01, peptide loading (e.g. NY-ESO), and introduction of the matching TCR into donor-derived primary T cells. Such an approach would allow the use of multiple T cell donors and would substantially strengthen the generalizability of the conclusions.

      (2) Expression levels of ICAM1:

      Based on available database information (e.g. UniProt) and published literature (PMID: 9371813), ICAM1 appears to be expressed at relatively low levels in both HeLa and LB33-MEL cells. While the effects on T cells are initially discussed in terms of broader changes in EndoA3-mediated recycling of multiple surface proteins, including ICAM1 and ALCAM (and potentially others), the focus of the manuscript increasingly shifts toward ICAM1 as the primary driver of the observed phenotypes. Given the comparatively low endogenous expression of ICAM1 in the chosen model systems, it is unclear whether this emphasis is fully justified. In addition, if ICAM1 polarization toward the immunological synapse was assessed using ICAM1 overexpression, whereas other phenotypes (such as enlarged contact area) were analyzed under endogenous expression conditions, this further complicates the interpretation. As a first step toward clarifying these issues, it would be helpful to include representative flow cytometry histograms showing surface expression levels of ICAM1 and ALCAM, rather than only normalized quantifications.

      (3) Cell-cell contact dynamics:

      The manuscript suggests that altered contact dynamics may underlie the observed increase in cytotoxicity upon EndoA3 depletion. However, these claims are not directly tested. Such effects could be addressed with relatively straightforward experiments, for example by directly measuring T cell-cancer contact duration in co-culture assays.

    3. Reviewer #2 (Public review):

      The manuscript by Xu et al. studies the relevance of endophilin A3-dependent endocytosis and retrograde transport of immune synapse components and in the activation of cytotoxic CD8 T cells. First, the authors show that ICAM1 and ALCAM, known component of immune synapses, are endocytosed via endoA3-dependent endocytosis and retrogradely transported to the Golgi. The authors then show that blocking internalization or retrograde trafficking reduces the activation of CD8 T cells. Moreover, this diminished CD8 T cells activation resulted the formation of an enlarged immune synapse with reduced ICAM1 recruitment.

      Comments on revisions:

      The authors have addressed all my comments adequately.

    4. Reviewer #3 (Public review):

      Shiqiang Xu and colleagues have examined the importance of ICAM-1 and ALCAM internalization and retrograde transport in cancer cells on formation of a polarized immunological synapse with cytotoxic CD8+ T cells. They find that internalization is mediated by Endophilin A3 (EndoA3) while retrograde transport to the Golgi apparatus is mediated by the retromer complex. Perturbing these trafficking pathways reduces cytokine release, but increases cytolytic killing. The paper is building on previous findings from corresponding author Henri-François Renard showing that ALCAM is an EndoA3 dependent cargo in clathrin-independent endocytosis.

      The work is interesting as it describes a novel mechanism by which cancer cells might influence CD8+ T cell activation and immunological synapse formation, and the authors have used a variety of cell biology and immunology methods to study this. The authors have also made substantial efforts to address the reviewers comments to the first version of the paper. However, there are still some points which could be further improved to underpin their conclusions:

      The movies and the related micrographs of EndoA3-mediated ICAM-1 endocytosis could be more convincing. Is the invagination of large membrane patches visible by volumetric imaging (e.g. confocal z-stacks) or brightfield microscopy?

      There is still a lack of quantitative evidence for polarized transport of ICAM-1 positive vesicles towards the immunological synapse. Only one example is shown and the authors state that the data is from a single movie representative of two independent experiments. If there are multiple cells per experiment, the number of cells should be stated and more examples should be included.

    5. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      This study by Xu et al. focuses on the impact of clathrin-independent endocytosis in cancer cells on T cell activation. In particular, by using a combination of biochemical approaches and imaging, the authors identify ICAM1, the ligand for T cell-expressed integrin LFA-1, as a novel cargo for EndoA3-mediated endocytosis. Subsequently, the authors aim to identify functional implications for T cell activation, using a combination of cytokine assays and imaging experiments.

      They find that the absence of EndoA3 leads to a reduction in T cell-produced cytokine levels. Additionally, they observe slightly reduced levels of ICAM1 at the immunological synapse and an enlarged contact area between T cells and cancer cells. Taken together, the authors propose a mechanism where EndoA3-mediated endocytosis of ICAM1, followed by retrograde transport, supplies the immunological synapse with ICAM1. In the absence of EndoA3, T cells attempt to compensate for suboptimal ICAM1 levels at the synapse by enlarging their contact area, which proves insufficient and leads to lower levels of T cell activation.

      Strengths:

      The authors utilize a rigorous and innovative experimental approach that convincingly identifies ICAM1 as a novel cargo for Endo3A-mediated endocytosis.

      Weaknesses:

      The characterization of the effects of Endo3A absence on T cell activation appears incomplete. Key aspects, such as surface marker upregulation, T cell proliferation, integrin signalling and most importantly, the killing of cancer cells, are not comprehensively investigated.

      We agree with the reviewer that the effects of EndoA3 depletion on T cell activation were not characterized enough. In new data presented in Fig.S4G-J, we explored additional activation markers and proliferation parameters. We didn’t observe any difference for the surface markers PD-1, CD137 and Tim-3 between LB33-MEL EndoA3+ cells treated with control and EndoA3 siRNAs. Regarding proliferation (Fig. S4J), although the proliferation index seems slightly lower upon EndoA3 depletion, we didn’t observe any significant difference either. Degranulation has also been monitored (Fig. S4K), but we didn’t observe any significant differences. In the new Fig. 3F however, we performed chromium release assays to assess the killing of cancer cells. Very interestingly, we observed an ~15% higher lysis of LB33-MEL EndoA3+ cells after EndoA3 depletion, when compared to the control condition at a ratio of 3:1 T cells:target cells (where the maximal effect is observed). These data are further discussed in the discussion section (new §6-9).

      As Endo- and exocytosis are intricately linked with the biophysical properties of the cellular membrane (e.g. membrane tension), which can significantly impact T-cell activation and cytotoxicity, the authors should address this possibility and ideally address it experimentally to some degree.

      Evaluating changes in the biophysical properties of cancer cell plasma membrane upon EndoA3 depletion is not trivial. An indirect way to address this question is by observing the area and shape of cells after siRNA treatment. In the new data added in the new Fig. S4B-D, we compared the area, aspect ratio and roundness of LB33-MEL EndoA3+ cells treated with negative control or EndoA3 siRNAs. While we observed a slight cell area reduction upon EndoA3 depletion, no significant changes were observed regarding the aspect ratio and the roundness. Hence, we think that the biophysical properties of cancer cells are not drastically modified by EndoA3 depletion.

      Crucially, key literature relevant to this research, addressing the role of ICAM1 endocytosis in antigen-presenting cells, has not been taken into consideration.

      We thank the reviewer for this important point. We have now considered and cited the relevant literature (Discussion, Page no.9).

      Reviewer #2 (Public review):

      Summary:

      The manuscript by Xu et al. studies the relevance of endophilin A3-dependent endocytosis and retrograde transport of immune synapse components and in the activation of cytotoxic CD8 T cells. First, the authors show that ICAM1 and ALCAM, known components of immune synapses, are endocytosed via endoA3-dependent endocytosis and retrogradely transported to the Golgi. The authors then show that blocking internalization or retrograde trafficking reduces the activation of CD8 T cells. Moreover, this diminished CD8 T cell activation resulted in the formation of an enlarged immune synapse with reduced ICAM1 recruitment.

      Strengths:

      The authors show a novel EndoA3-dependent endocytic cargo and provide strong evidence linking EndoA3 endocytosis to the retrograde transport of ALCAM and ICAM1.

      Weaknesses:

      The role of EndoA3 in the process of T cell activation is shown in a cell that requires exogenous expression of this gene. Moreover, the authors claim that their findings are important for polarized redistribution of cargoes, but failed to show convincingly that the cargoes they are studying are polarized in their experimental system. The statistics of the manuscript also require some refinement.

      We fully acknowledge that the requirement for exogenous expression of EndoA3 in our immunological model represents a limitation of our study. Unfortunately, it remains challenging to identify cancer cell lines for which autologous CD8 T cells are available and that endogenously express all molecular players investigated (in particular EndoA3). At this stage, we do not have access to any other cancer cell line/autologous CD8⁺ T cell pairs that are sufficiently well characterized. In future studies, it would be valuable to investigate tumor types with high endogenous EndoA3 expression (such as glioblastomas, gliomas, and head and neck cancers) for which autologous CD8 T cells could be obtained, but this remains technically challenging.

      To address the reviewer’s second point regarding polarized redistribution of cargoes, we have added new data in the new Figure 4 and Movies S8-9. Using high-speed spinningdisk live-cell confocal microscopy, we captured the movement of ICAM1-positive tubulovesicular carriers in cancer cells at the moment of contact with CD8 T cells. Capturing such events is technically challenging, as T cell–cancer cell contacts form randomly and transiently. Successful imaging requires that the cancer cell be well spread and express ICAM1–GFP at an optimal level (as it is transiently expressed as a GFP-tagged construct), while acquisition must occur precisely at the moment when the T cell initiates contact. Despite these technical constraints, we successfully imaged early stages of immune synapse formation, enabling visualization of ICAM1 vesicular transport.

      The data reveal a flux of ICAM1-positive carriers emerging from the perinuclear region (corresponding to the Golgi area) and moving toward the contact site with the CD8 T cell, with fusion events of vesicles occurring at the developing immune synapse. AI-based segmentation and tracking analyses showed that ICAM1-positive carrier trajectories were predominantly oriented toward the forming immune synapse, whereas carriers moving toward other cellular regions were markedly less frequent. These results provide direct evidence for polarized ICAM1 transport via vesicular trafficking toward the immune synapse.

      Reviewer #3 (Public review):

      Summary:

      Shiqiang Xu and colleagues have examined the importance of ICAM-1 and ALCAM internalization and retrograde transport in cancer cells on the formation of a polarized immunological synapse with cytotoxic CD8+ T cells. They find that internalization is mediated by Endophilin A3 (EndoA3) while retrograde transport to the Golgi apparatus is mediated by the retromer complex. The paper is building on previous findings from corresponding author Henri-François Renard showing that ALCAM is an EndoA3dependent cargo in clathrin-independent endocytosis.

      Strengths:

      The work is interesting as it describes a novel mechanism by which cancer cells might influence CD8+ T cell activation and immunological synapse formation, and the authors have used a variety of cell biology and immunology methods to study this. However, there are some aspects of the paper that should be addressed more thoroughly to substantiate the conclusions made by the authors.

      Weaknesses:

      In Figure 2A-B, the authors show micrographs from live TIRF movies of HeLa and LB33MEL cells stably expressing EndoA3-GFP and transiently expressing ICAM-1-mScarlet. The ICAM-1 signal appears diffuse across the plasma membrane while the EndoA3 signal is partially punctate and partially lining the edge of membrane patches. Previous studies of EndoA3-mediated endocytosis have indicated that this can be observed as transient cargo-enriched puncta on the cell surface. In the present study, there is only one example of such an ICAM-1 and EndoA3 positive punctate event. Other examples of overlapping signals between ICAM-1 and EndoA3 are shown, but these either show retracting ICAM1 positive membrane protrusions or large membrane patches encircled by EndoA3. While these might represent different modes of EndoA3-mediated ICAM-1 internalization, any conclusion on this would require further investigation.

      We agree with the reviewer that the pattern of cargoes during endocytosis (puncta vs large patches) as observed by live-cell TIRF microscopy may be confusing. Actually, a punctate pattern has been observed quasi systematically when we monitored the uptake of endogenous cargoes via antibody uptake assays (whatever the imaging approach: TIRF, spinning-disk, classical confocal or lattice light-sheet microscopy). For example:

      - ALCAM: Fig.1e-h, Supplementary Figure 5 and Supplementary Movies 1-3 and 6 in Renard et al. 2020, https://doi.org/10.1038/s41467-020-15303-y; Fig.1D and Movie 2 in Tyckaert et al. 2022, https://doi.org/10.1242/jcs.259623.

      - L1CAM: Fig.2 and 3D, Movies S1-4 in Lemaigre et al. 2023, https://doi.org/10.1111/tra.12883.

      In rare examples, bigger clusters of antibodies were observed, where EndoA3 was observed to surround them, delineate them in a “lasso-like” pattern, and the clusters were progressively taken up:

      - ALCAM: Supplementary Movie 4 in Renard et al. 2020, https://doi.org/10.1038/s41467-020-15303-y.

      However, bigger patches of cargoes were more often observed when uptake was observed using transient expression of GFP-/mCherry-tagged versions of cargoes. In these cases, EndoA3 was predominantly observed to delineate cargo patches as a “lasso-like” pattern, progressively triming those patches leading to endocytosis. For example:

      - L1CAM: Fig.3E, Movie S5-7 in Lemaigre et al. 2023, https://doi.org/10.1111/tra.12883.

      - We also observed this pattern with CD166-GFP (unpublished).

      The fact that we observed rather patches than punctate patterns upon transient expression of fluorescently-tagged constructs of cargoes is likely due to the elevated expression level of the cargoes.

      Therefore, the patchy pattern observed for ICAM1 and ALCAM, transiently expressed in fusion with fluorescent proteins, and surrounded by EndoA3 in Fig.2A-B and old Movies S1-3, is not surprising. Of note, upon anti-ALCAM antibody uptake, we observed a more punctate pattern (Fig.2C), as previously described. Unfortunately, the lower quality of commercial anti-ICAM1 antibody did not allow us to proceed to uptake assays as for ALCAM.

      Regarding Fig.S2 and old Movies S4-5, we agree with the reviewer that these data may be misleading, as they represent phenomena happening at protrusions and contact zones between two adjacent cells. We have now replaced these images with other examples where we avoid contact zones (Fig.S2 and new Movies S5-7).

      These different patterns (patches vs dots) are still unexplained at the current stage, and may indeed represent different modes of endocytosis. We think these various patterns may depend on the abundance/expression level of cargoes and their degree of clustering. This will be investigated in future studies. Still, whatever the pattern, these data demonstrate and confirm the association between EndoA3 and cargoes (such as ICAM1 or ALCAM), even in the absence of antibodies.

      Moreover, in Figure 2C-E, uptake of the previously established EndoA3 endocytic cargo ALCAM is analyzed by quantifying total internal fluorescence in LB33-MEL cells of antibody labelled ALCAM following both overexpression and siRNA-mediated knockdown of EndoA3, showing increased and decreased uptake respectively. Why has not the same quantification been done for the proposed novel EndoA3 endocytic cargo ICAM-1? Furthermore, if endocytosis of ICAM-1 and ALCAM is diminished following EndoA3 knockdown, the expression level on the cell surface would presumably increase accordingly. This has been shown for ALCAM previously and should also be quantified for ICAM-1.

      As correctly pointed by the reviewer, anti-ICAM1 antibody uptake assays would have been great. We have tried to do them many times. Unfortunately, all commercial antibodies we tested did not yield satisfying results in uptake experiments. Either the labeling was too week/non-specific, or the antibody was not effectively stripped from the cell surface by acid washes, i.e. the acid-wash conditions required for efficient stripping were too harsh for the cells to tolerate. We have tried other approaches using the same commercial antibody which do not require acid washes (loss of surface assays by FACS, or uptake assays using surface protein biotinylation) or based on insertion of an Alfa-tag in the extracellular part of ICAM1 by CRISPR-Cas9 and detection of ICAM1 with an antiAlfa-tag nanobody (unpublished approach; collaboration with the lab of Prof. Leonardo Almeida-Souza, University of Helsinki, who developed the approach), but without success. However, we were more successful with the SNAP-tag-based approach to follow retrograde transport, for which the commercial anti-ICAM1 antibody worked properly. In Fig. 1F, we could show that retrograde transport of ICAM1 (and thus most likely its endocytosis step) was significantly decreased upon EndoA3 depletion in HeLa cells, indirectly demonstrating that ICAM1 is effectively an EndoA3-dependent cargo.

      Regarding the fact that surface level of ICAM1 should increase upon perturbation of EndoA3-mediated endocytosis, we agree with the reviewer that this could be an expected result. However, this is not necessarily systematic, as the surface level of a protein cargo is always the result of a balance between its endocytosis, recycling to plasma membrane, and lysosomal degradation. We also have to take into account the neosynthesized protein flux. One must also consider that multiple endocytic mechanisms exist in parallel, and that the perturbation of one mechanism (EndoA3-mediated CIE, here) may be partially compensated by others, as cargoes can often be taken up via multiple endocytic doors. Hence, an increased abundance at the cell surface is not always guaranteed upon endocytosis perturbation. Anyway, we measured the cell surface level of both ICAM1 and ALCAM in LB33-MEL EndoA3+ cells treated with negative control or EndoA3 siRNAs (Fig. S4E-F). Only minor differences were observed.

      In Figure 4A the authors show micrographs from a live-cell Airyscan movie (Movie S6) of a CD8+ T cell incubated with HeLa cells stably expressing HLA-A*68012 and transiently expressing ICAM1-EGFP. From the movie, it seems that some ICAM-1 positive vesicles in one of the HeLa cells are moving towards the T cell. However, it does not appear like the T cell has formed a stable immunological synapse but rather perhaps a motile kinapse. Furthermore, to conclude that the ICAM-1 positive vesicles are transported toward the T cell in a polarized manner, vesicles from multiple cells should be tracked and their overall directionality should be analyzed. It would also strengthen the paper if the authors could show additional evidence for polarization of the cancer cells in response to T-cell interaction.

      A similar point was raised by reviewer #2. We have revised this section accordingly. In the new Fig. 4 and Movies S8-9, we replaced the live-cell Airyscan confocal data with highspeed spinning-disk confocal imaging data, enabling a more accurate analysis of cargo polarized redistribution and at a higher time resolution.

      Using this approach, we captured the movement of ICAM1-positive tubulo-vesicular carriers in cancer cells at the moment of contact with CD8 T cells. Capturing such events is technically challenging, as T cell–cancer cell contacts form randomly and transiently. Successful imaging requires that the cancer cell be well spread and express ICAM1–GFP at an optimal level (as it is transiently expressed as a GFP-tagged construct), while acquisition must occur precisely at the moment when the T cell initiates contact. Despite these technical constraints, we successfully imaged early stages of immune synapse formation, enabling visualization of ICAM1 vesicular transport.

      The data reveal a flux of ICAM1-positive carriers emerging from the perinuclear region (corresponding to the Golgi area) and moving toward the contact site with the CD8 T cell, with fusion events of carriers occurring at the developing immune synapse.

      AI-based segmentation and tracking analyses showed that ICAM1-positive carrier trajectories were predominantly oriented toward the forming immune synapse, whereas carriers moving toward other cellular regions were markedly less frequent. These results provide direct evidence for polarized ICAM1 transport via vesicular trafficking toward the immune synapse.

      Finally, in Figures 4D-G, the authors show that the contact area between CD8+ T cells and LB33-MEL cells is increased in response to siRNA-mediated knockdown of EndoA3 and VPS26A. While this could be caused by reduced polarized delivery of ICAM-1 and ALCAM to the interface between the cells, it could also be caused by other factors such as increased cell surface expression of these proteins due to diminished endocytosis, and/or morphological changes in the cancer cells resulting from disrupted membrane traffic. More experimental evidence is needed to support the working model in Figure 4H.

      Regarding the cell surface expression of both ICAM1 and ALCAM, as already explained above, only minor differences were observed (Fig. S4E-F). Regarding morphological changes of cancer cells upon EndoA3 depletion (Fig. S4B-D), we compared the area, aspect ratio and roundness of LB33-MEL EndoA3+ cells treated with negative control or EndoA3 siRNAs. While we observed a slight cell area reduction upon EndoA3 depletion, no significant changes were observed regarding the aspect ratio and the roundness. Cancer cell morphology is thus not drastically modified by EndoA3 depletion. All these new data are now discussed in the manuscript.

      Recommendations for the authors:

      Reviewing Editor Comments:

      The reviewers discussed the paper and all agreed it was incomplete in supporting the conclusions. Additional data needed to support the conclusions were:

      (1) Better characterisation of Endo3A-expressing and knock-down cells such as morphology, ICAM-1, and ALCAM surface levels to name two parameters.

      As discussed above, we have now added new data addressing these points:

      - Morphology: Fig. S4B-D

      - ICAM1 and ALCAM surface levels: Fig. S4E-F These new data are discussed in the main text.

      (2) Better characterisation of the ICAM-1 polarisation process. Does this require interaction with LFA-1 can ICAM-1 be delivered to the synapse without this?

      As discussed above, we have now added new data better addressing the characterization of ICAM1 polarized trafficking to the immune synapse, that can be found in the new Fig. 4 (high-speed spinning-disk confocal imaging of ICAM1 trafficking upon conjugate formation between CD8 T cell and cancer cell). The text has been modified accordingly. The dependency on LFA-1 has not been addressed directly, but we may suppose it is indeed important as (i) it has already been addressed in other cellular systems by previous studies (Jo et al. 2010), and (ii) we observed a denser flux of ICAM1-positive carriers in the cancer cell toward regions involved in immune synapses with CD8 T cells, than other regions. As we didn’t address this question more directly in our study, we briefly mentioned this point in the Discussion section.

      (3) Better characterisation of T cell response- activation markers, cytotoxicity assays.

      As discussed above, we have now added new data addressing these points:

      - Cell surface activation markers: Fig. S4G-I

      - Proliferation: Fig. S4J

      - Degranulation: Fig. S4K

      - Cytotoxic activity: Fig. 3F

      These new data are discussed in the main text.

      (4) Citing relevant literature.

      The relevant literature (in particular the paper by Jo et al. 2010) is now cited and discussed.

      (5) Number of donors evaluated - is it true there was only one blood donor? For human studies better to have key results on >4 donors.

      Our immunological working model indeed originates from a single patient (Baurain et al., 2000), from whom both a cancer cell line (LB33-MEL) and autologous CD8 T cells were derived. These CD8 T cells specifically recognize an HLA molecule presenting a defined antigenic peptide (MUM-3) on the surface of the cancer cells. This provides us with a unique and fully natural experimental system that allows us to faithfully reconstitute cytotoxic T lymphocyte (CTL)-mediated killing of cancer cells in vitro.

      Using CD8 T cells from other donors would not be meaningful in this context, as they would not recognize the LB33-MEL cells. Conversely, testing the same CD8 T cells on other cancer cell lines requires engineering these lines to express the appropriate HLA molecule and to be exogenously pulsed with the correct antigenic peptide – which is precisely what we did with the HeLa cell line.

      Therefore, increasing the number of donors would require obtaining both cancer cell lines and CD8 T cells from each donor, ideally with evidence that the donor’s T cells recognize their own tumor cells. This is technically challenging and not trivial, although it would indeed be highly valuable to diversify immunological models in future studies.

      Importantly, the high specificity of our autologous co-culture system, where cancer cells interact with their naturally matched CD8 T cells, offers clear advantages over commonly used in vitro models such as Jurkat (T) and Raji (B) cell lines, which rely on artificial stimulation with a superantigen to enforce immunological synapse formation and T cell activation.

      (6) How does the binding of antibodies to ICAM-1 and ALCAM impact their trafficking?

      As IgG antibodies are bivalent and can bind two target antigens, they may induce clustering, which could in turn affect endocytosis. To address this concern, we performed an uptake assay based on surface protein biotinylation using a cleavable biotin reagent (with a reducible linker). Briefly, after allowing endocytosis for different time intervals, cell surface–exposed biotins were removed by treatment with the cellimpermeable reducing agent MESNA, while internalized (endocytosed) biotinylated proteins remained protected. These internalized proteins were then recovered by affinity purification on streptavidin resin and analyzed by Western blot to detect the protein of interest.

      Importantly, this uptake assay can be performed in the absence or presence of an anticargo antibody, allowing assessment of its potential influence on endocytosis. Author response image 1 shows the results for ALCAM uptake in HeLa cells, with and without anti-ALCAM antibody:

      Author response image 1.

      Antibody binding to an extracellular epitope of ALCAM increases its endocytosis. HeLa cellsurface proteins were biotinylated on ice using EZ-Link Sulfo-NHS-SS-Biotin (Pierce) and then incubated at 37 °C for the indicated times to allow endocytosis. Internalization was assessed in the absence or presence of an anti-ALCAM antibody (Ab) added to the extracellular medium. Endocytosis was stopped by returning the cells to ice, and surface-exposed biotin was removed by treatment with the cell-impermeable reducing agent MESNA. Internalized, MESNA-resistant biotinylated proteins were affinity-purified on streptavidin resin and analyzed by Western blot to detect ALCAM. The “unstripped” condition shows the total amount of ALCAM at the cell surface at the beginning of the experiment (signal at ~95 kDa). Quantification of the time course (normalized to the no-antibody condition) shows increased ALCAM endocytosis in the presence of antibody at 15 and 30 min. Blot is representative of two independent experiments; quantifications include data from both experiments.

      We observed that the anti-ALCAM antibody slightly enhanced ALCAM uptake. A similar experiment was attempted for ICAM1, but we were unable to detect the protein by Western blot using the available commercial antibody.

      Although this outcome was expected, it highlights a potential caveat in using antibodies to monitor endocytosis. Alternative tools such as nanobodies, while monovalent and theoretically less perturbing, are not yet available for many cargo proteins and may still influence cargo conformation or dynamics. Therefore, antibodies remain the current gold standard in endocytosis studies. Nevertheless, data obtained with antibodies should always be validated by complementary approaches that do not rely on antibody binding, as we have done in this study (e.g. live-cell imaging of fluorescently tagged proteins).

      The work is of interest and we look forward to your response/revision.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      Thank you for submitting your manuscript which I had the pleasure to review. While I enjoyed your work, I feel that it would strongly benefit by addressing the following points:

      (1) In-depth characterization of T cell responses upon Endo3A depletion: The characterization should be expanded to include surface marker upregulation, T cell proliferation, and, most importantly, tumor cell cytotoxicity. I was wondering if the incomplete characterization of T-cell responses is due to limited supplies of antigenspecific T-cells? My understanding is that these cells have been derived from a single patient. This also raises concerns in terms of reproducibility as all data are practically from a single biological replicate. My suggestion would be to use an additional system of specific cell-cell contacts to complement the current findings. For instance, HeLa cells could be transfected to express CD19 or EpCAM, for both of which bispecific T cell engagers (Invivogen) exist that would allow specific contact formation, thereby allowing the study of the effect of Endo3A depletion across T cells from different donors and through a more complete set of assays.

      We refer the reviewer to our responses above, where these points have been addressed in detail. We sincerely thank the reviewer for the excellent suggestion of transfecting HeLa cells with CD19 or EpCAM and using bispecific T-cell engagers. However, after careful consideration, we concluded that this approach falls outside the scope of the present study, which was specifically designed to investigate the most natural system, cancer cells and their autologous CD8 T cells. We nevertheless appreciate this insightful suggestion and will certainly consider it for future studies.

      (2) Alterations in membrane tension as an alternative explanation: Endo- and exocytosis have been found to influence the biophysical properties of cells, such as membrane tension (e.g., Djakbaravo et al., 2021, PMID: 33788963), which in turn influences their susceptibility to cytotoxic T cells with lower tension corresponding to reduced cytotoxicity (e.g., Basu & Whitlock, 2016, PMID: 26924577). Thus, interference with endocytic pathways could arguably lead to changes in membrane tension that could contribute to the observed effects. These possible effects should be discussed and addressed experimentally to a degree. While measuring membrane tension directly requires specialized expertise (e.g., tether pulling experiments) and is not within the scope of this study, membrane tension affects cell spreading and actin organization. Thus, I would suggest conducting a thorough comparative phenotypical and morphological characterization of the Endo3A+ and Endo3A- cancer cells to estimate the possible effect of changes in membrane tension (if any) on the results.

      We refer the reviewer to our responses above, where these points have been addressed in detail. New data have been added and the text of our manuscript has been modified accordingly.

      (3) Citation and consideration of earlier work: Jo & Kwon et al., 2010 (PMID: 20681010) have previously shown that ICAM1 undergoes clathrin-independent recycling and repolarization to the immunological synapse in APCs. Furthermore, they provided evidence that actin-based transport, but not lateral diffusion, together with recycling is crucial for the repolarization of ICAM1 to the immunological synapse. This important earlier work has to be cited. Actin-based transport on the cell surface has not been considered in the current manuscript. In light of these earlier findings, it is unclear in Figure 4A if ICAM1 is delivered to the T cell from within- or from the surface of the cancer cell. I would suggest changing the imaging modalities in this experiment to be able to differentiate cell surface from internal ICAM1, e.g., by detaching the cancer cells from the surface as has been done in Fig. 4B, E, and F.

      We refer the reviewer to our responses above, where these points have been addressed in detail. New data have been added and the text of our manuscript has been modified accordingly.

      Reviewer #2 (Recommendations for the authors):

      Major comments:

      (1) The authors should be more careful with their claims about the importance of their results for cell polarity as their evidence for this is scarce (i.e. The live-cell imaging in Figure 4A is not quantified and the ICAM1 polarization effect shown in figure 4B-C is, albeit significant, small and not very convincing).

      We refer the reviewer to our responses above, where these points have been addressed in detail. New data have been added and the text of our manuscript has been modified accordingly.

      (2) The absence (or very low expression) of EndoA3 on the LB33-MEL cell suggests that EndoA3-mediated recycling of immune synaptic components is not required for T-cell activation. The fact that EndoA3 exogenous expression in LB33-MEL cells leads to increased cytokine production in T cells is, however, interesting.

      We fully agree with the reviewer’s observation. Although EndoA3 is not expressed in some cellular contexts, its cargoes may still be present. It is therefore reasonable to assume that alternative endocytic mechanisms can compensate for its absence. It is now widely accepted that many cargoes can be internalized through multiple endocytic routes, and that the relative contribution of each pathway depends strongly on the cellular and physiological context.

      For example, we have shown that ALCAM and L1CAM, although primarily internalized via clathrin-independent pathways, present a minor fraction (< 25%) undergoing clathrinmediated endocytosis (Renard et al., 2020; Lemaigre et al., 2023). Moreover, we observed that inhibition of macropinocytosis enhances EndoA3-mediated endocytosis of ALCAM, indicating a crosstalk between specific EndoA3-mediated clathrin-independent endocytosis (CIE) and non-specific macropinocytosis (Tyckaert et al., 2022).

      Thus, even in the absence of EndoA3, its cargoes are likely internalized through alternative endocytic routes. Nonetheless, our data clearly demonstrate that EndoA3 expression markedly enhances the endocytosis and intracellular trafficking of its cargoes, ultimately leading to modified CD8 T cell responses.

      (3) For the statistics in bar graphs (graphs 1C, D, E &F; 3E, 3F, S1C-I, and S3C), one cannot have all values for controls simply normalized to 1. This procedure hides the variance for the controls between each replicate and makes any statistics meaningless.

      We thank the reviewer for this important remark. Regarding Figures 1C–F, S1C–I, and S3C, which correspond to quantifications from Western blots, it is standard practice to normalize the quantification to a control condition set to 1 (or 100%). Absolute signal intensities cannot be directly compared across different blots due to the variability inherent to this semi-quantitative technique. For this reason, we chose to keep the data presented in normalized form. However, we agree that this type of data require the careful choice of a convenient statistical analysis approach. Here, we choose one-sample T tests, allowing to test the hypothesis that the various siRNA conditions are different from 100% (the normalized value of the siCtrl condition). We adapted the statistical analysis accordingly in the different figures mentioned.

      Regarding old Figures 3E–F (now Fig. 3E and 3G), which correspond to IFNγ secretion assays, we agree that representing IFNγ secretion as a fold change relative to a control condition may obscure inter-experimental variability. However, this format was intentionally chosen to facilitate data interpretation, as IFNγ secretion was quantified by ELISA and also displayed inter-experimental variability. For completeness, we now provide below the corresponding graphs showing absolute IFNγ concentrations, which retain the information on inter-experimental variability (Author response image 2). As you can see, the overall conclusions remain unchanged.

      Author response image 2.

      IFNg secretion data corresponding to Fig. 3E and 3G, expressed in absolute values (pg/mL)

      Minor comments:

      (1) What happens to surface and total levels of ICAM1 and ALCAM in the retromer or EndoA3 knockdown/overexpression conditions? This information would put the effects described into context.

      We refer the reviewer to our responses above, where these points have been addressed in detail. New data have been added and the text of our manuscript has been modified accordingly.

      (2) The authors should clearly indicate that BFA means bafilomycin A in the figure legend or methods.

      BFA corresponds to Brefeldin A. We have now clarified this information in legends and methods.

      (3) In the sentence: "These data demonstrate that retromer-mediated retrograde transport is critical for trafficking ALCAM and ICAM1 to the Golgi and that this process requires the full secretory capacity of the TGN." What do the authors mean by full secretory capacity?

      We have modified the sentence: “Together, these data demonstrate that retromermediated retrograde transport is critical for trafficking ALCAM and ICAM1 to the Golgi and that this process requires efficient secretion from the TGN (as evidenced by the involvement of Rab6).”

      (4) The method used for retrograde transport seems to be a variation of the original protocol (reference 43). The manuscript would benefit from a thorough explanation of this assay, rather than citing the original protocol.

      We did not modify the original SNAP-tag–based protocol used to monitor retrograde transport. A comprehensive methodological paper has been published (ref. 44), and we have followed it strictly. Additionally, we briefly summarized the rationale of the approach in Figure 1A and in the first paragraph of the Results section.

    1. eLife Assessment

      This important study investigates how infestation by the small brown planthopper (Laodelphax striatellus) reshapes rice carbohydrate allocation and demonstrates that host-derived glucose enhances insect fecundity and imidacloprid tolerance, through the activation of conserved nutrient-sensing and endocrine pathways. Across extensive and complementary approaches, including plant manipulations, glucose supplementation, RNAi, pharmacological inhibition, rescue experiments, and biochemical assays, the authors provide convincing evidence that glucose activates the TOR-juvenile hormone-vitellogenin axis to promote reproduction and co-regulates GST-mediated detoxification via both TOR-JH signaling and GCL-GSH metabolism. The mechanistic framework is coherent and well supported by hierarchical validation and functional assays. Some limitations remain regarding the generality of the findings across other pest species and insecticides, and aspects of the evolutionary framing would benefit from more cautious interpretation; nonetheless, the work substantially advances our understanding of how plant-derived nutrients interface with conserved insect signaling pathways to shape fitness-related traits, and will be of broad interest to researchers studying plant-insect interactions, insect physiology, and pest management.

    2. Reviewer #1 (Public review):

      Summary:

      The authors investigate how infestation of rice plants by the small brown planthopper (Laodelphax striatellus), an important pest in rice cultivation, alters host plant carbohydrate metabolism and how these changes affect insect physiology and fitness. They show that planthopper infestation leads to a density-dependent increase in glucose levels in rice plants, which the authors suggest results from a redistribution of carbohydrates from roots to shoots. Elevated glucose levels in plants are reflected by increased glucose contents in the insects themselves, an effect that is particularly pronounced in gravid females and associated with enhanced fecundity.

      In addition, the authors demonstrate that increased glucose availability enhances tolerance of the small brown planthopper to the neonicotinoid insecticide imidacloprid. These findings suggest that insect-mediated changes in plant carbohydrate allocation may benefit insect fitness in multiple ways, including increased reproductive output and enhanced tolerance to insecticides, both of which are relevant for understanding insect population dynamics in agroecosystems.

      Beyond these physiological observations, the authors aim to elucidate the underlying molecular mechanisms. They propose that glucose functions not only as a nutritional resource but also as a signaling molecule. Specifically, they show that increased glucose availability is associated with activation of the Target Of Rapamycin (TOR) pathway, a conserved nutrient-sensing signaling pathway regulating growth and metabolism across eukaryotes. Activation of TOR signaling is linked to increased juvenile hormone levels, which in turn stimulate vitellogenesis and likely contribute to increased fecundity. Furthermore, elevated juvenile hormone levels are associated with increased expression of glutathione S-transferases, suggesting a mechanism contributing to enhanced detoxification capacity. Independent of this pathway, increased glucose availability also leads to higher expression of glutamate-cysteine ligase, the rate-limiting enzyme in glutathione synthesis. Together, these mechanisms provide a non-exclusive explanation for the observed increase in imidacloprid tolerance and form the basis of the authors' proposed mechanistic framework linking glucose availability to reproduction and detoxification.

      Strengths:

      A major strength of the manuscript is its substantial mechanistic depth and the extensive use of complementary experimental approaches that converge on a coherent mechanistic interpretation. The authors combine plant manipulations, dietary supplementation, injection assays, RNAi-mediated gene silencing, pharmacological inhibition, and rescue experiments to systematically test the role of glucose as a signaling molecule linking plant-derived nutrition to insect reproduction and insecticide tolerance. Results obtained from independent experimental strategies are highly consistent, and the different datasets collectively support the central conclusions of the study.

      The role of glucose is supported by multiple lines of evidence demonstrating that increased glucose availability, whether induced by prior planthopper feeding, dietary supplementation, or direct injection, consistently results in elevated glucose levels in insects, increased oviposition, and enhanced expression of vitellogenesis-related genes (LsVg and LsVgR). The specificity of this effect is further strengthened by experiments using alternative carbohydrates that release glucose upon enzymatic cleavage, as well as inhibitor and rescue experiments, supporting the interpretation that glucose acts beyond a purely nutritional role.

      The authors further establish a mechanistic link between glucose availability, TOR signaling, juvenile hormone regulation, and vitellogenesis. Activation of TOR signaling by glucose, demonstrated at the level of protein phosphorylation, together with RNAi knockdown and pharmacological inhibition, allows causal placement of TOR upstream of juvenile hormone signaling. Consistent reductions in juvenile hormone titers, vitellogenesis-related gene expression, and oviposition following TOR inhibition, as well as rescue of reproductive output by juvenile hormone analog treatment, provide strong functional support for a glucose-TOR-juvenile hormone axis regulating fecundity. The absence of additive effects following combined knockdown of TOR and juvenile hormone synthesis components further supports the interpretation that these factors act within the same signaling cascade.

      Similarly, the authors provide a detailed mechanistic analysis of glucose-mediated effects on imidacloprid tolerance. Functional assays demonstrate that glutathione S-transferases contribute to detoxification in this species and that increased glucose availability enhances GST activity, glutathione synthesis, and overall glutathione levels. Transcriptomic analyses and targeted RNAi experiments further identify specific GSTs contributing to insecticide tolerance and indicate that glucose enhances detoxification through both TOR-dependent and TOR-independent mechanisms. The combined knockdown experiments, which produce additive effects on mortality, provide particularly strong support for the involvement of multiple interacting glucose-dependent pathways.

      Weaknesses:

      While I am impressed by the mechanistic depth of the study and the clarity with which the authors dissect the underlying physiological pathways, I am less convinced by the current conceptual framing of the phenomenon as a sophisticated adaptive strategy "co-opted" by the small brown planthopper. The data convincingly demonstrate that glucose availability activates conserved nutrient-sensing and endocrine pathways, including TOR signaling and juvenile hormone regulation, which in turn affect reproduction and detoxification capacity. However, these pathways are deeply conserved and likely operate in many insects in response to nutritional status. As such, the results may reflect a general physiological response to elevated carbohydrate availability rather than a species-specific, evolved strategy. Relatedly, herbivory-induced changes in plant carbohydrate allocation appear to be relatively common across plant-insect systems, and it would be helpful to discuss how specific (or general) the observed phenomenon is likely to be.

      In particular, I encourage the authors to more clearly distinguish between (i) a conserved nutrient-responsive signaling cascade and (ii) an adaptive mechanism that evolved specifically under selection imposed by insecticide exposure. The presented data strongly support the former interpretation, whereas evidence for the latter is less clear. The increased tolerance to imidacloprid appears to arise as a consequence of enhanced metabolic and detoxification capacity under elevated glucose conditions, rather than as a trait shaped directly by insecticide-driven selection. Framing this phenomenon as an adaptation to insecticide stress may therefore overextend the conclusions that can be drawn from the data. A more cautious discussion acknowledging that glucose-mediated activation of conserved metabolic and endocrine pathways may incidentally enhance insecticide tolerance, without necessarily having evolved under insecticide selection, would strengthen the conceptual clarity of the manuscript.

    3. Reviewer #2 (Public review):

      Summary:

      Zhang and colleagues investigate the molecular mechanisms by which the small brown planthopper (SBPH, Laodelphax striatellus) manipulates host rice carbohydrate metabolism to enhance its own fitness. Using a combination of molecular, pharmacological, and biochemical approaches, they demonstrate that SBPH infestation induces systemic glucose reallocation in rice, as evidenced by the upregulation of glucose levels in aerial tissues and a simultaneous reduction in root glucose levels. Notably, host-derived glucose acts as a central signaling molecule, driving two key adaptive traits: enhanced fecundity via the glucose-TOR-JH-Vg signaling cascade, and increased imidacloprid tolerance through synergistic metabolic (GCL-GSH) and regulatory (TOR-JH-GST) pathways targeting GST activity. These findings uncover a sophisticated resource-manipulation strategy in SBPH and identify nutrient-sensing and detoxification pathways as potential targets for pest control.

      Strengths:

      (1) The study addresses a gap in plant-insect coevolution research by identifying glucose as a dual-function signaling molecule that coordinates SBPH reproduction and insecticide tolerance, providing valuable insights into how herbivores exploit host nutritional signals.

      (2) The experimental design is well structured and multifaceted, integrating RNAi, RT-qPCR, Western blotting, pharmacological inhibition, and biochemical assays. The use of appropriate controls (e.g., osmotic controls with mannitol and hydrolase-inhibitor rescue experiments) strengthens the causal interpretation of the results.

      (3) The mechanistic framework is clear and well-supported. The authors delineate two interconnected molecular cascades (glucose-TOR-JH-Vg for fecundity and GCL-GSH/TOR-JH-GST for tolerance) with hierarchical validation (e.g., rescue experiments with JHA), ensuring the reliability of conclusions.

      Weaknesses:

      (1) The study focuses exclusively on SBPH without validating whether the observed phenomena and mechanisms are conserved in closely related planthopper species (e.g., brown planthopper Nilaparvata lugens). This limitation restricts the generalizability of the findings to other economically important rice pests.

      (2) The specific upstream signals that trigger glucose reallocation in rice (e.g., SBPH salivary effectors or oviposition-associated factors) are not identified. Although this represents a complex and independent research direction, the absence of such information limits the depth and completeness of the mechanistic framework and leaves open questions regarding the initiation of host metabolic manipulation.

      (3) Insecticide tolerance assays are limited to imidacloprid. Extending these analyses to one or two additional commonly used insecticides (e.g., thiamethoxam) would help determine whether the glucose-mediated detoxification pathway is specific to imidacloprid or reflects a broader resistance mechanism, thereby strengthening conclusions regarding the generality of the GST activation cascade.

      (4) Given the study's potential implications for pest management, the manuscript would benefit from a brief discussion of possible practical applications, such as manipulating rice glucose metabolism through breeding strategies or developing small-molecule inhibitors targeting the TOR-JH axis. Including such perspectives would enhance the translational relevance of the work by linking mechanistic insights to real-world pest control strategies.

    1. eLife Assessment

      This manuscript presents a valuable investigation of the peptidoglycan (PG) recycling pathway in Caulobacter crescentus. The authors showed that PG recycling in C. crescentus is essential not only for β-lactam (ampicillin) resistance but also for cell morphology, efficient division, and overall fitness. The study is comprehensive and compelling.

    2. Reviewer #1 (Public review):

      Summary:

      In their manuscript, Richter and colleagues comprehensively investigate the cell wall recycling pathway in the model alphaproteobacterium Caulobacter crescentus using biochemical, imaging, and genetic approaches. They clearly demonstrate that this organism encodes a functional peptidoglycan recycling pathway and demonstrate the activities of many enzymes and transporters within this pathway. They leverage imaging and growth assays to demonstrate that mutants in peptidoglycan recycling have varying degrees of beta-lactam sensitivity as well as morphological and cell division defects. They propose that, rather than impacting the levels or activity of the major beta-lactamase, BlaA, defects in PG recycling lead to beta-lactam sensitivity by limiting the availability of new cell wall precursors. The findings will be of interest to those in the field of bacterial cell wall biochemistry, antibiotics and antibiotic resistance, and bacterial morphogenesis.

      Strengths:

      Overall the manuscript is laid out logically, and the data are comprehensive, quantitative, and rigorous. The mutants and their phenotypes will be a valuable resource for Caulobacter researchers, and the findings may be relevant to cell wall recycling in other organisms.

      Weaknesses:

      No major weaknesses are noted.

      Comments on revisions:

      The authors addressed all of our concerns with the initial submission.

    3. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      In their manuscript, Richter and colleagues comprehensively investigate the cell wall recycling pathway in the model alphaproteobacterium Caulobacter crescentus using biochemical, imaging, and genetic approaches. They clearly demonstrate that this organism encodes a functional peptidoglycan recycling pathway and demonstrate the activities of many enzymes and transporters within this pathway. They leverage imaging and growth assays to demonstrate that mutants in peptidoglycan recycling have varying degrees of beta-lactam sensitivity as well as morphological and cell division defects. They propose that, rather than impacting the levels or activity of the major beta-lactamase, BlaA, defects in PG recycling lead to beta-lactam sensitivity by limiting the availability of new cell wall precursors. The findings will be of interest to those in the field of bacterial cell wall biochemistry, antibiotics and antibiotic resistance, and bacterial morphogenesis.

      Strengths:

      Overall, the manuscript is laid out logically, and the data are comprehensive, quantitative, and rigorous. The mutants and their phenotypes will be a valuable resource for Caulobacter researchers.

      Thank you for this positive evaluation. Previous work has mostly focused on the role of PG recycling in the regulation of ampC expression. However, our study and recent work in A. tumefaciens (Gilmore & Cava, 2022) and C. crescentus (Modi et al, 2025) demonstrates that β-lactam resistance is heavily influenced by PG recycling and the metabolic state of the cell, even in the presence of high levels of β-lactamase activity. It is likely that these effects are not limited to the two alpha­proteo­bacterial species investigated to date but may be more widely applicable. Therefore, we believe that our results are relevant beyond the Caulobacter field and may help to stimulate similar analyses in other, medi­cally more relevant species.

      Weaknesses:

      The only major missing piece is the complementation of mutants to demonstrate that loss of the targeted gene is responsible for the observed phenotypes.

      In our initial manuscript, we showed that the replacement of the native AmiR and NagZ genes with mutant alleles encoding catalytically inactive variants of the two proteins gave rise to the same pheno­types as gene deletions. This finding indicates that the defects observed were due to the loss of AmiR or NagZ activity, respectively. To rule out artifacts from polar effects, we have now also conducted the requested complementation analysis for the ΔampG, ΔamiR and ΔnagZ mutants. The results obtained show that deletion mutants carrying an ectopically expressed wild-type gene copy behave essentially like the wild-type strain, thereby verify­ing the validity of our conclusions (new Figure 4-figure supple­ment 1).

      Reviewer #2 (Public review):

      Summary:

      Pia Richter et al. investigated the peptidoglycan (PG) recycling metabolism in the alpha-proteobacterium Caulobacter crescentus. The authors first identified a functional recycling pathway in this organism, which is similar to the Pseudomonas route, and they characterized two key enzymes (NagZ, AmiR) of this pathway, showing that AmiR differs in specificity from the AmpD counterpart of E. coli. Further, they studied the effects of deletions within the PG recycling pathway (ampG, amiR, nagZ, sdpA, blaA, nagA1, nagA2, amgK, nagK mutants), showing filamentation and cell widening, thereby revealing a link between PG recycling and cell division. Finally, they provide a link between PG recycling and beta-lactam sensitivity in C. crescents that is not caused by activation of a beta-lactamase, but rather is a result of reduced supply of PG building blocks increasing the sensitivity of penicillin-binding proteins.

      Strengths:

      This work adds to the understanding of the role of PG recycling in alpha-proteobacteria, which significantly differ in their mode of cell wall growth from the better studied gamma-proteobacteria.

      Thank you for pointing out the relevance of our work. As mentioned above, we believe that our work goes beyond understanding the PG recycling pathway in alphaproteobacteria. Importantly, together with previous work, our results demonstrate a so-far largely neglected critical role of PG recycling in β-lactam resistance that goes beyond the mere regula­tion of β-lactamase gene expression. It will be interesting to determine the conservation of this phenomenon among other bacteria and to see whether blocking PG recycling could represent a potential strategy to combat β-lactam resistant pathogens.

      Weaknesses:

      The findings are not entirely novel as recent studies by Modi et al. 2025 mBio (studying C. crescentus) and Gilmore & Cava 2022 Nat. Commun. (studying Agrobacterium tumefaciens) came to similar conclusions.

      Gilmore & Cava have made the seminal finding that blocking anhydro-muropeptide import affects cell wall integrity in a manner that is partly independent of its effect on ampC expression. We now extend this finding by investigating various critical steps in the PG recycling pathway of C. cres­centus, a species lacking an AmpC homolog. Interestingly, by characterizing a variety of different mutants, we show that the morphol­ogical and ampicillin resistance defects they exhibit are not strictly con­nected and vary substantially between strains, suggesting that different steps in PG recycling differ in their importance for cellular fitness and cell wall integrity. This finding suggests that the phenotypes observed are not simply determined by the efficiency of PG recycling but likely result from a combination of factors. Based on the results obtained, we propose a model that highlights the different factors that may be at play and suggests a mechanism explaining their effects on β-lactam resistance and cell division. Our findings partly overlap with the recent study by Modi et al., but there are various points in which we disagree with their findings and conclusions. The need to rigorously validate our differing results led to a signi­ficant delay in the submission of our manuscript.

      Reviewer #1 (Recommendations for the authors):

      Major Comment

      Genetic complementation is lacking for deletion mutants throughout. Could you please provide complemented strains for mutants in key figures where deletion phenotypes are central to the conclusions (e.g., Figure 4 and related supplements).

      As explained above, we have not performed the requested comple­mentation experiments and included the data as Figure 4-figure supplement 1.

      Other minor comments:

      (1) Figure 1

      (a) This is a busy schematic; please consider visually separating PG biosynthesis vs. recycling (e.g., a faint divider line or shaded boxes).

      We have now simplified the schematic and visually separated the PG recycling and de novo biosyn­thesis pathways.

      (b) Please label "Fructose-6-phosphate" and "Glucosamine-6-phosphate (GlcN-6-P)" on the figure, since they are referenced in the caption (line 1410).

      The symbols for fructose, glucosamine and phosphate are given in the legend on the right. For consistency, we would therefore prefer not to additionally label these compounds in the figure.

      (c) Define all abbreviations in the caption: CM, GTase, TPase; and clarify the legend conventions (e.g., bold vs. regular font; red vs. black text).

      The structure of PG and the different lytic enzymes have now been removed from Figure 1. All remaining abbreviations have now been defined in the legend.

      (2) Figure 2 - Figure Supplement 2

      (a) Panel B: Please include the full chromatogram (it seems to be cropped at 10 min?). For AmiR in particular, it is important to show there are no nearby peaks at earlier retention times (eg GlcNAc).

      The region before 10 min is cropped in many published muropeptide profiles because the peaks contained in it are known to correspond to salts, i.e., borate from the reduction step and phos­phate, which are poorly retained on the C18 column (Figure 2–figure supplement 2). As the reviewer stated, free GlcNAc would elute in this region and would not be recognized if it were produced by AmiR. However, AmiR cleaves free anhydro-muropeptides between anhMurNAc and the peptide, and the experiment in Figure 2–figure supplement 2 shows that it does not cleave the bond between MurNAc and peptides in intact peptidoglycan.

      (b) Caption line 1439: with AmiR OR the catalytically...

      Done.

      (3) Figure 3

      Panel A: Label the products as NagZ-treated.

      In this analysis, we quantify specific intermediates from the total cellular pool of PG recycling inter­mediates. Since the products were not specifically treated with NagZ, we would prefer to keep the figures as it is.

      (4) Figure 4 (and Fig. 4-Figure Supplement 1, 2)

      (a) Please add complemented strains for ΔampG, ΔamiR, and ΔnagZ under the same conditions.

      As described in more detail above, we have now performed the requested complementation analysis.

      (b) Figure 4 - Figure S1 - Please include images of all strains quantified in B (e.g. control WT).

      Done.

      (c) Figure 4 - Figure S2: A. Please include images of all strains quantified in B. Please include spotting dilutions on minimal medium to assess the importance of PG recycling under nutrient limitation, especially given apparent lysis in ΔamiR and ΔampG.

      The length distributions of cells grown in PYE medium are taken from Figure 3 and only shown for comparison (as mentioned in the figure legend). To avoid the duplication of images, we would prefer to keep panel A as it is.

      We have now performed the requested serial-dilution spot assay on minimal (M2G) medium. The results show that ampicillin resistance de­creases even more dramatically for all strains in this condi­tion. The new data are presented in Figure 4-figure supplement 3C.

      (d) Figure 4 - Figures S3: A and B. Please include WT control.

      We have now added images of the wild-type strain to panel B of this figure. The serial dilution spot assays shown in panel A were performed on the same plates as those depicted in Figure 4 (as men­tioned in the figure legend). To avoid the duplication of images, we would prefer to keep this panel as it is.

      (5) Figure 5

      A, C - please include images of WT control.

      We have now added images of the wild-type strain to panel A of this figure. The serial dilution spot assays shown in panel C were performed on the same plates as those depicted in Figure 4 (as men­tioned in the figure legend). To avoid the duplication of images, we would prefer to keep this panel as it is.

      (6) Figure 6:

      (a) A, C - please include images of WT control.

      We have now added images of the wild-type strain to panel A of this figure. The serial dilution spot assays shown in panel C were performed on the same plates as those depicted in Figure 4 (as men­tioned in the figure legend). To avoid the duplication of images, we would prefer to keep this panel as it is.

      (b) It would be informative to test ΔamgK and ΔanmK on minimal medium (spotting and/or growth curves) to position these steps within the nutrient-dependent fitness landscape.

      We have now analyzed the ampicillin sensitivity of the ΔamgK, ΔnagK and ΔamgK ΔnagK strains on minimal medium (see Author response image 1). Consistent with the results obtained for other mutants in the PG recycling pathway, growth on minimal (M2G) medium plates leads to increased ampicillin sensi­tivity of the ΔamgK mutant. By contrast, ΔnagK and, to a lesser extent, ΔamgK ΔnagK cells show an in­creased tolerance to ampicillin under these conditions compared to growth on PYE plates.

      This phenomenon may be explained by the strong stimulatory effect of GlcNAc-6-P on NagB acti­vity. In the absence of NagK, GlcNAc-6-P levels drop, leading to reduced activation of NagB1/2. This effect, combined with abundant glucose to support central carbon metabolism may promote the GlcN-6-P biosynthesis through GlmS, thereby increasing the flux of meta­bol­ites into the de novo PG biosynthesis pathway and thus boosting ampicillin tolerance. However, more re­search is required to fully under­stand the molecular basis of this effect. Given that the results are likely to reflect complex interactions bet­ween dysregulated enzyme activity and altered metabolite pools caused by increased glucose avail­ability, they provide only limited insight into the role of PG recycling in ampicillin resistance. We therefore propose excluding this experiment from the present manuscript to avoid confusion.

      Author response image 1.

      Serial-dilution spot assay investigating the ampicillin resistance of the indicated mutant strains on minimal (M2G) medium plates.

      (c) Could Figures 6 and 7 be combined for better comparison and since there is no WT control? If so, could you also include the MurNAc cytoplasmic level quantification for the double mutant (Figure 7)?

      We would prefer to keep the two figures separated to avoid creating an overly large figure that contains a total of nine panels. However, we have now included an additional panel in Figure 7 show­ing the levels of MurNAc in the double mutant.

      (7) Figure 7. A, C

      Please include images of WT control.

      We have now added images of the wild-type strain (now panel B). The serial dilution spot assays (now panel D) were performed on the same plates as those depicted in Figure 4 (as men­tioned in the figure legend). To avoid the duplication of images, we would prefer to keep this panel as it is.

      (8) Figure 8-S1D, F

      Please include images of WT control.

      Panel F of this figure already contains a wild-type control.

      (9) Figure 10 A, C

      Please include images of WT control and ∆amiR (A).

      Done.

      (10) Figure 11

      Consider adding or highlighting in this figure (in a simplified manner) the major PG recycling differences in Caulobacter? The current model doesn't really show any difference that is unknown.

      This figure presents a model of the mechanism underlying the increased β-lactam sensitivity of PG recycling-deficient cells. Since the PG recycling pathway of C. crescentus is already presented in detail in Figure 1, we would like to keep this figure simple and thus leave it as it is.

      (11) Comments by lines:

      (a) Line 192: Clarify that NagZ is also part of the rate-limiting step since there is no difference between AmiR or NagZ order of hydrolysis?

      We have now omitted the statement that AmiR catalyzes the rate-limiting step in the PG recycling process, because our data do not allow definitive conclusions on this point.

      (b) Line 201: Define "considerable fraction" since this is known, please and cite original reference(s).

      Done.

      (c) Line 203: Please also cite the primary papers where they have found that disruption of the PG recycling pathway in E. coli and P. aeruginosa doesn't result in morphological defects.

      Since there are a number of papers that report PG recycling-deficient mutants of E. coli and P. aeru­ginosa, we would like to keep citing reviews to support this statement. However, we have now addi­tionally included a review by Park & Uehara (2008), which provides a detailed overview of PG recycling in bacteria.

      (d) Line 220-223: Though there are no obvious morphological defects, several mutants (e.g., ΔamiR, ΔampG) appear to be lysing or stressed under minimal conditions. Could you include spotting assays and/or growth curves on minimal medium (Figure 4, Figure S2) to quantify fitness under nutrient limitation?

      Have performed the requested serial dilution spot assays on minimal (M2G) medium plates and now present the data obtained in Figure 4-figure supplement 3C.

      (e) Line 224: PG recycling has been found to contribute to the regulation of B-lactam resistance in several organisms, not just those two. Perhaps add "including C. freundii and P. aeruginosa"

      Done.

      (12) Typographical errors:

      (a) Line 284: "caron" should be carbon.

      Done.

      (b) Line 323: "Figure C" needs a figure number.

      Done.

      (c) Line 33: "regulaton" should be regulation.

      Done.

      Reviewer #2 (Recommendations for the authors):

      (1) The study is well conducted and describes a number of experiments that significantly deepen previous findings. The conclusions of this paper are mostly well supported by data, but some experiments and data analysis may need to be clarified and extended.

      Thank you for this positive evaluation.

      (2) The data presented in Figures 2B and 2C show activities of AmiR and NagZ using LTase-cleaved cell wall preparations. Unfortunately, the preparations tested with the two enzymes should be identical, but apparently are not. Why aren't identical preparations used?

      We are sorry for the confusion. As stated in the Methods section (page 28, lines 757 and 773), the AmiR activity assays used LT products from PG sacculi isolated from E. coli D456, whereas the NagZ activity assays used LT-products from PG sacculi isolated from E. coli CS703-1. Both strains have a higher penta­peptide content than wild-type E. coli D456 lacks PBPs 4, 5 and 6 and has a moderate level of pentapeptides. CS703-1 lacks PBPs 1a, 4, 5, 6, 7 as well as AmpC and AmpH, and is known to have a higher pentapeptide content than D456. These differences are the reason for the distinct muro­peptide profiles in panel B and C of Figure 2.

      (3) I am missing a control experiment where muropeptides treated with NagZ were further digested with AmiR? This would show whether AmiR is able or not to cleave MurNAc-peptides. This is not evident from the provided experiments.

      We have now tested the activity of AmiR towards anhMurNAc-tetrapeptide in vitro. The results show that AmiR efficiently cleaves this GlcNAc-free anhydro-muropeptide species, verifying that it can also act on turnover products that have been previously processed by NagZ. The new data are shown in Figure 2–figure supplement 5.

      (4) The claim that PG recycling is critical, particularly upon transition to the stationary phase and under nutrient limitation, is not justified. It conflicts with the obvious morphological effects also in the exponential phase and with the absence of morphological defects in minimal medium: pronounced defects in rich PYE medium (Figure 4A/B) disappear in minimal M2G medium (Figure 4_figure supplement 2). It seems that catabolite repression effects apply here. Is the morphological effect in rich PYE medium reversed by adding glucose?

      We agree that PG recycling is not considerably more important in stationary phase and have removed this statement. Interestingly, while PG recycling-deficient mutants show no obvious mor­phol­ogical defects in minimal (M2G) medium, their ampicillin sensitivity even increases under this condi­tion (new Figure 4-figure supplement 3C), confirming that morphological and resistance defects are not strictly coupled. Preliminary data indicate that the morphological defects of the mutant cells are also abolished upon growth in PYE+glucose medium. High glucose availability may promote increased de novo synthesis of PG precursors, thereby partially restoring the PG precursor pool. We propose that the morphological and resistance phenotypes develop at different degrees of PG precursor depletion. However, future research is required to clarify the precise molecular basis of this phenomenon.

      (5) Figure 4: Why is the contribution of AmpG to ampicillin resistance much lower than for amiR or nagZ, despite ampG mutants showing the largest morphological defects? Does the accumulation of UDP-MurNAc or UDP-MurNAc-peptide correlate with ampicillin resistance, whereas the morphological effects correlate with the lack of precursors?

      The exact reason why the ΔampG mutant shows such a strong discrepancy in the severity of its morphol­ogical and resistance defects compared to the ΔamiR and ΔnagZ mutants remains unclear, because all of these deletions completely block the recycling of anhydro-muropeptides. The major difference in the ΔampG mutant is its inability to import anhydro-muropeptides, causing their accu­mu­lation in the periplasm. We propose that periplasmic anhydro-muropeptides, in particular the penta­peptide-containing species, can interact with the substrate-binding sites of PG metabolic enzymes, thereby interfering with proper PG biosyn­thesis. Conversely, by interacting with transpep­tidases, they may reduce their accessibility to ampicillin and thus preserve their acti­vity under β-lactam stress, particularly under conditions in which low PG precursor availability reduces binding site occupancy and thus facilitates antibiotic association.

    1. eLife Assessment

      This important study introduces an advance in multi-animal tracking by reframing identity assignment as a self-supervised contrastive representation learning problem. It eliminates the need for segments of video where all animals are simultaneously visible and individually identifiable, and significantly improves tracking speed, accuracy, and robustness with respect to occlusion. This innovation, which is supported through compelling evidence, has implications beyond animal tracking, potentially connecting with advances in behavioral analysis and computer vision.

    2. Reviewer #3 (Public review):

      Summary:

      The authors propose a new version of idTracker.ai for animal tracking. Specifically, they apply contrastive learning to embed cropped images of animals into a feature space where clusters correspond to individual animal identities. By doing this, they address the requirement for so-called global fragments - segments of the video, in which all entities are visible/detected at the same time. In general, the new method reduces the long tracking times from the previous versions, while also increasing the average accuracy of assigning the identity labels.

      Comments on revisions:

      I have no additional comments, the authors have responded to all the points I raised previously.

    3. Author response:

      The following is the authors’ response to the previous reviews

      eLife Assessment

      This important study introduces an advance in multi-animal tracking by reframing identity assignment as a self-supervised contrastive representation learning problem. It eliminates the need for segments of video where all animals are simultaneously visible and individually identifiable, and significantly improves tracking speed, accuracy, and robustness with respect to occlusion. This innovation has implications beyond animal tracking, potentially connecting with advances in behavioral analysis and computer vision. The strength of support for these advances is compelling overall, although there were some remaining minor methodological concerns.

      To tackle “minor methodological concerns” mentioned in the Editorial assessment and Reviewer 3, the new version of the manuscript includes the following changes:

      a) The new ms does not anymore use the word “accuracy” but “IDF1 scores”. See, for example, Lines 46, 161, 176, and 522 for our new wording as “IDF1 scores”.

      b) Instead of comparing softwares using mean accuracy over the benchmark, Reviewer 3 proposes to use medians or even boxplots. We now provide boxplot results with mean, median, percentiles and outliers (Figure 1- figure Supplement 2).

      Additionally, we also include in the text the other recommendations from Reviewer 3:

      a) We now more explicitly describe the problems of the original idtracker.ai v4 in the benchmark (lines 66-68). Around half of the videos had a high accuracy in the original dtracker.ai (v4) but the other half of the videos had lower accuracies (Figure 1a, blue). The new idtracker.ai has high accuracy values for all the videos (Figure 1a, magenta).

      Also, the videos with high accuracy in the old idtracker.ai had very long tracking times (Figure 1b, blue) and the new version does not (Figure 1b, magenta). So the benchmark allows us to distinguish the new idtracker.ai as having a better accuracy for all videos and lower tracking times, making it a much more practical system than previous ones. 

      b) We further clarified the occlusion experiment (lines 188-190 and 277-290).

      c) We explain why we measure accuracies with and without animal crossings (lines 49-62).

      d) We added a Discussion section (lines 223-244).

      We believe the new version has clarified the minor methodological concerns.

      Reviewer #3 (Public review):

      The authors have reorganized and rewritten a substantial portion of their manuscript, which has improved the overall clarity and structure to some extent. In particular, omitting the different protocols enhanced readability. However, all technical details are now in appendix which is now referred to more frequently in the manuscript, which was already the case in the initial submission. These frequent references to the appendix - and even to appendices from previous versions - make it difficult to read and fully understand the method and the evaluations in detail. A more self-contained description of the method within the main text would be highly appreciated.

      In the new ms, we have reduced the references to the appendix by having a more detailed explanation in one place, lines 49-62.

      Furthermore, the authors state that they changed their evaluation metric from accuracy to IDF1. However, throughout the manuscript they continue to refer to "accuracy" when evaluating and comparing results. It is unclear which accuracy metric was used or whether the authors are confusing the two metrics. This point needs clarification, as IDF1 is not an "accuracy" measure but rather an F1-score over identity assignments.

      We thank the reviewer for noticing this. Following this recommendation, we changed how we refer to the accuracy measure with “IDF1 score” in the entire ms. See, for example, lines 46, 161, 176, and 522.

      The authors compare the speedups of the new version with those of the previous ones by taking the average. However, it appears that there are striking outliers in the tracking performance data (see Supplementary Table 1-4). Therefore, using the average may not be the most appropriate way to compare. The authors should consider using the median or providing more detailed statistics (e.g., boxplots) to better illustrate the distributions.

      We thank the reviewer for asking for more detailed statistics. We added the requested box plot in Figure 1- figure Supplement 2 to provide more complete statistics in the comparison.

      The authors did not provide any conclusion or discussion section. Including a concise conclusion that summarizes the main findings and their implications would help to convey the message of the manuscript.

      We added a Discussion section in lines 223-244.

      The authors report an improvement in the mean accuracy across all benchmarks from 99.49% to 99.82% (with crossings). While this represents a slight improvement, the datasets used for benchmarking seem relatively simple and already largely "solved". Therefore, the impact of this work on the field may be limited. It would be more informative to evaluate the method on more challenging datasets that include frequent occlusions, crossings, or animals with similar appearances.

      Around half of the videos also had a very high accuracy in the original dtracker.ai (v4) but the other half of the videos had lower accuracies (Figure 1a, blue). For example, we found IDF1 scores of 94.47% for a video of 100 zebrafish with thousands of crossings (z_100_1), 93.77% for a video of 4 mice (m_4_2) and 69.66% for a video of 100 flies (d_100_3). The new idtracker.ai has high accuracy values for all the videos (Figure 1a, magenta).

      Importantly, the tracking times for the majority of videos was very high in the original idtracker.ai (Figure 1b, blue), making the use of the tracking system limited in practice. The new system manages both a high accuracy in all videos (Figure 1a, magenta) and much lower tracking times (Figure 1b, magenta), making it a much more practical system..

      We have added a sentence of the limitations of the original idtracker.ai as obtained from the benchmark, lines 66-68.

      The accuracy reported in the main text is "without crossings" - this seems like incomplete evaluation, especially that tracking objects that do not cross seems a straightforward task. Information is missing why crossings are a problem and are dealt with separately.

      We have now added an explanation on why we measure accuracy without crossings and why we separated it from the accuracy for all the trajectory in lines 49-62. The reason is that the identification algorithm being presented in this ms only identifies animal images outside the crossings. This algorithm makes robust animal identifications through the video despite the thousands of animal crossings typically existing in each of our videos used in the benchmark. It is a second algorithm (that hasn’t changed since the first idTracker in 2014) the one that assigns animal positions during crossings once the first algorithm has made animal identifications before and after the crossings.

      There are several videos with a much lower tracking accuracy, explaining what the challenges of these videos are and why the method fails in such cases would help to understand the method's usability and weak points.

      Some videos had low accuracy on previous versions (Figure 1a, blue), but the new idtracker.ai has high accuracy in all of them (Figure 1a, magenta).

      Reviewer #3 (Recommendations for the authors):

      (1) As described before, the authors claim to use IDF1 as their metric in the whole manuscript (lines 414-436) but only refer to accuracy when presenting the results. It is not clear, whether accuracy was used as a metric instead of IDF1 or the authors are confusing these metrics.

      Following this recommendation, we replaced “accuracy” with “IDF1 score” , see lines 46, 161, 176, and 522.

      (2) In the introduction, a brief explanation why crossings need to be dealt with separately would help to understand the logic of the method design.

      We added such an explanation in lines 49-62.

      (3) Figure 3: We asked about how the tracking accuracy is being assessed with occlusions. The authors responded with that only the GT points inside the ROI are taken into account when computing the accuracy. Does this mean, that the occluded blobs are still part of the CNN training and the clustering? This questions the purpose of this experiment, since the accuracy performance would therefore only change, if the errors, that their approach is doing either way, are outside the ROI and, therefore, not part of the metric evaluation.

      The occluded blobs are not part of any training because they are erased from the video, they do not exist. We made this more clear in lines 188-190 and 277-290.

      (4) Figure 1: The fact that datasets are connected with a line is misleading - there is no connection between the data along the x-axis. A line plot is not an appropriate way to present these results.

      The new ms clarifies that the lines are for ease of visualization, see last line in the caption of Figure 1.

      (5) Lines 38-39: It is not clear how the CNN can be pretrained for the entire video if there are no global segments or only short ones. Here, the distinction between "no segments", "only short segments" and "pretraining on the entire video" is not explained.

      This pretraining protocol is not used in the version of the software we present, so details of this are not as relevant.

      (6) Figure 2a: The authors are showing "individual fragments" and individual fragments in a global fragment." However, it seems there are a few blue borders missing. In the text (l. 73-79), they note, that they are displaying them as "examples" but the absence of correct blue borders is confusing.

      In the new ms, we have replaced the label “Individual fragments in a global fragment” with “Individual fragments in an example global fragment” in the legend of Figure 2.

      (7) Lines 61-63, 148-151, and 162-164: Could the authors clarify why they used the average instead of median when comparing the speedups of the new version and the old ones?

      We thank the reviewer for asking for more detailed statistics. We added the requested box plot in Figure 1- figure Supplement 2 to provide more complete statistics in the comparison of accuracies and tracking times for old and new systems.

      (8) Lines 140-144: The post-processing steps are not clear. The authors should rather state clearly which processes of the old versions they are using. Then the authors could shortly explain them.

      We removed this paragraph and explained in more detail in lines 49-62 which parts of the software are new and which ones are not.

      (9) Lines 239-251: Here, the authors are clarifying on a section 1-2 pages before. This information should be directly in that section instead.

      Following this recommendation, we clarified the occlusion experiment in the main text (lines 188-191) to make it more self-contained. Still, the flow of the main text is better with some details in Methods.

      (10) Line 38: It is not clear how the CNN can be pretrained for the entire video if there are no global segments or only short ones. Here, the distinction between "no segments"

      "only short segments" and "pretraining on the entire video" is a bit misleading/underexplained.

      See number 5.

      (11) Figure 2a: The authors are showing "individual fragments" and individual fragments in a global fragment." However, it seems there are a few blue borders missing. In the text (l. 73-79), they note, that they are displaying them as "examples" but the absence of correct blue borders is confusing.

      See number 6.

      (12) Figure 2c and line 115-118: "Batches" itself is not meaningful without any information of the batch size. The authors should rather depict the batch size and then the number of epochs. The Figure 2 contains the info 400 positive and 400 negative pairs of images per batch. However, there is no information about the total number of images.

      Furthermore, these metrics are inappropriate here, since training is carried out from scratch (or already pre-trained) for every new video, each video has different number of animals, different number of images.

      Following this recommendation, we clarified the number of images in each batch (Figure 1c caption and lines 134-138), why we do not work with epochs (lines 700-702), and the idea that the clusters in Figure 2 represent an example and the number of batches needed for the clusters to form depends on the video details.

      Appendix 1-figure 1: why do the methods fail? It looks that for certain videos the method is fairly unreliable. What is the reason for the methods to crash and how to avoid this?

      Those failures are only for the old idtracker.ai and Trex, not for the method presented here. Our new contrastive algorithm does not fail in any of the videos in the benchmark.

      We thank the reviewer for the detailed suggestions. We believe we have incorporated all of them in the new version of the ms.

    1. eLife Assessment

      Karimian et al. present a valuable new model to explain how gamma-band synchrony (30-80 Hz) can support human visual feature binding by selectively grouping image elements, countering recent criticisms that the stimulus dependence of gamma oscillations limits their functional role. Grounded in the theory of weakly coupled oscillators the model captures behavioural patterns observed in human psychophysics, offering support for the potential role of synchrony-based mechanisms in feature-binding. The development of the model in alignment with primate electrophysiology convincingly supports the paper's claims that gamma synchrony may be the underlying mechanism. While the paper does not present electrophysiological results that directly link gamma oscillations to figure-ground segregation in the presented task, the model makes several predictions that can be tested experimentally.