10,000 Matching Annotations
  1. Last 7 days
    1. eLife Assessment

      This important study provides a comprehensive map of how touch-sensitive neurons in the fly head connect to downstream circuits, revealing parallel pathways that preserve spatial organization and identifying a developmentally defined circuit linking sensory input to grooming behavior. The evidence is convincing, with detailed anatomical reconstruction and quantitative analysis supporting the main claims, while the link to behaviour remains based on prior functional work. The study will be of interest to neuroscientists studying sensory processing and motor control, and provides an invaluable resource for future functional investigations.

    2. Joint Public Review:

      Summary:

      Calle-Schuler et. al. reconstruct all the pre- and post-synaptic neurons to the bristle mechanosensory neurons on the adult fly head to understand if neural circuits support the parallel mechanosensory pathways, which could be instrumental in shaping the sequential motor patterns during fly grooming. They find that most presynaptic neurons, interneurons and excitatory post synaptic neurons are also somatotopically organized, such that each neuron is more connected to bristles mechanosensory neurons that are closer on the head and less connected to bristles mechanosensory neurons that are further away. These include the direct BMN-BMN circuits, excitatory interneurons, as well as the inhibitory networks. They also identify that the one entire hemi-lineage 23b form excitatory postsynaptic circuit with BMNs, highlighting how these circuits and hence their function could be developmentally determined.

      Strengths:

      This is a complete map of the all the neurons which make 5 or more pre- and post-synaptic connections of the fly head BMNs. Using this, the authors have identified various trends such as ascending neurons provide most of the GABAergic inhibitory input, which could provide the presynaptic inhibition essential for the parallel model for sequential grooming generation. Moreover, they identified that the entire cholinergic hemilineage 23b is postsynaptic to BMNs. Both their excitatory postsynaptic connectivity and inhibitory presynaptic connectivity demonstrate core motifs of the parallel circuits necessary for the hierarchical suppression model of grooming sequence.

      Weaknesses:

      Somatotropic organization with hierarchical suppression is an elegant mechanism to generate sequential motor sequence during grooming. Yet, anatomical connectivity alone, in absence of functional connectivity, cannot explain the grooming motor sequences. Future work should be aimed at mapping the functional connectivity with behavioral sequence.

      Closing statement:

      The authors have addressed the major concerns regarding clarity, scope, and interpretation. The manuscript is now significantly improved and is clearly framed as an anatomical resource that identifies circuit motifs consistent with existing models of grooming control.

    3. Author response:

      The following is the authors’ response to the original reviews.

      We sincerely thank the Reviewers for their careful reading and insightful critiques, which have helped make the manuscript clearer and more impactful.

      In response to the Reviewers, we substantially revised the manuscript to improve clarity, framing, and accessibility for readers outside the Drosophila connectomics community, while keeping the core conclusions unchanged. We clarified the study’s scope (defining parallel circuit architecture rather than testing sufficiency for reconstructing grooming sequence order), restructured the last Introduction paragraph, several Results sections, and the Discussion to foreground the main findings and their relevance to the parallel hierarchical-suppression model. We also added key methodological clarifications for non-specialist readers, including how BMN classes were identified in FAFB by a correlative approach (with type-level, not single-bristle, resolution), how FlyWire/Codex synapse counts are defined (contacts vs T-bars), how sensory BMNs can have postsynaptic sites, and what is meant by ascending vs descending neurons in a brain-only dataset. Across the Results, we improved terminology and definitions (e.g., projection zones, hemilineage 23b, BMN nomenclature such as BM-InOm), clarified what derives from prior work (Eichler et al., 2024) versus new analyses, strengthened interpretation of BMN→motor connections as likely modulatory, and expanded explanation of postsynaptic partner categories. We also revised figures and legends to better highlight overlap/segregation and somatotopy, moved the cosine-similarity matrices into the main figures (new Figure 9), added a new graphical summary figure (new Figure 15), and explicitly acknowledged key limitations, including one-hemisphere analysis and lack of VNC coverage in FAFB.

      In addition, in response to the suggestion of a rank-order test relating BMN→second-order wiring to the grooming hierarchy, we clarified throughout the revised manuscript that this study does not aim to test whether connectivity alone is sufficient to reconstruct grooming sequence order, and we removed wording that could imply such a claim. As detailed in our response to that specific critique below, sequence sufficiency is outside the scope of this study, and a simple linear ordering based on aggregate synapse weights is not straightforward to interpret in this system (e.g., BM-Taste vs. BM-InOm output strength does not track grooming order, BMNs likely contribute to multiple behaviors, and head grooming order is not resolved at sufficient granularity). We therefore respectfully request that the sentence in the eLife Assessment suggesting that the paper is weakened by not including this analysis be removed. As currently written, it frames an out-of-scope analysis as a missing test of the manuscript’s main claims and may mislead readers about the paper’s intended contribution: a synaptic-resolution anatomical definition of parallel BMN circuit architecture and motifs consistent with hierarchical suppression.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      Calle-Schuler et. al. reconstruct all the pre- and post-synaptic neurons to the bristle mechanosensory neurons on the adult fly head to understand how neural circuits determine the sequential motor patterns during fly grooming. They find that most presynaptic neurons, interneurons, and excitatory postsynaptic neurons are also somatotopically organized, such that each neuron is more connected to bristles mechanosensory neurons that are closer on the head and less connected to bristles mechanosensory neurons that are further away. These include the direct BMN-BMN circuits, excitatory interneurons, as well as the inhibitory networks. They also identify that the entire hemi-lineage 23b forms excitatory postsynaptic circuits with BMNs, highlighting how these circuits and hence their function could be developmentally determined.

      Strengths:

      This is a complete map of all the neurons that make 5 or more pre- and post-synaptic connections of the fly head BMNs. Using this, the authors have identified various trends, such as ascending neurons providing most of the GABAergic inhibitory input, which could provide the presynaptic inhibition essential for the parallel model for sequential grooming generation. Moreover, they identified that the entire cholinergic hemilineage 23b is postsynaptic to BMNs.

      Weaknesses:

      Although the somatotropic organization is an elegant mechanism to generate sequential motor sequences during grooming, none of the analyses in the paper directly demonstrate that this somatotropic connectivity is sufficient to generate hierarchical suppression and reconstruct the grooming sequence. If somatotropic organization is sufficient, then hierarchical clustering should recover the grooming sequence. Their detailed connectome enables the authors to test if some networks are more crucial for grooming sequence than others: to what extent can each network individually (ascending neurons-BMN alone) or a combination (BMN-BMN, ascending-BMN, BMN-descending, etc.) recover the sequence observed during grooming. If all the pre- and post-synaptic neurons put together cannot explain the sequence, then the sequence is probably determined by individual synaptic strengths or other key downstream neurons.

      We appreciate the Reviewer’s interest in how BMN connectivity relates to the grooming sequence, and agree that understanding how mechanosensory circuits contribute to hierarchical action selection is an important direction. In this study, however, our goal was not to test whether connectivity alone is sufficient to reconstruct the full grooming sequence. Rather, we focused on defining the parallel circuit architecture underlying individual grooming movements and on identifying anatomical features—most notably extensive presynaptic inhibition—that are consistent with previously proposed models of hierarchical suppression.

      We recognize that aspects of the Introduction and the references cited there to prior work on the grooming sequence may have led some readers to expect a direct sequence-prediction analysis. To address this, we revised the Introduction and Results to clarify the scope of the study and adjusted language to avoid implying that we aimed to derive the grooming order from connectivity. Consistent with this framing, the Abstract mentions the sequence only in the context of presynaptic inhibition, which provides anatomical support for existing models of hierarchical suppression. We therefore do not draw conclusions about the ordering of grooming movements from the connectome itself. Details of the specific manuscript revisions are provided below in the Recommendations for authors section.

      The Reviewer suggests testing whether somatotopic organization is sufficient to recover the grooming sequence by clustering BMN connectivity or by examining whether specific subnetworks (e.g., BMN → ascending, BMN → descending, or BMN→BMN pathways) reproduce the sequence. We carefully considered these possibilities. However, several factors currently limit the interpretability of such analyses.

      First, synaptic weight alone does not align with known features of the grooming sequence. For example, BM-Taste neurons contribute the majority of BMN synaptic output, yet proboscis grooming is not the first head grooming movement, whereas BM-InOm neurons contribute less than 9% of total output despite eye grooming occurring first. As we now clarify in the Results, global synapse number therefore does not predict the order of grooming movements.

      Second, BMNs likely distribute signals across multiple behavioral pathways beyond grooming, including circuits involved in feeding and escape behaviors. Because the connectome aggregates all postsynaptic targets, analyses based solely on connectivity strength cannot isolate the subset of circuits specifically responsible for grooming-related action selection.

      Third, the head grooming sequence itself has not been resolved at the spatial granularity required for such analyses across head regions. While eye grooming is well characterized as the first head movement, the relative ordering among antennae, proboscis, and other head bristle regions remains less clearly defined, making it difficult to evaluate correspondence between connectivity-derived rankings and behavioral order.

      Because of these limitations, we concluded that clustering or network-based analyses aimed at reconstructing the grooming sequence from connectivity alone would be difficult to interpret and therefore chose not to include them. Accordingly, we have deliberately avoided claiming that the connectome is sufficient to generate the grooming sequence. Instead, we interpret the somatotopic architecture and inhibitory circuitry described here as anatomical features consistent with previously proposed models of hierarchical suppression, while leaving the question of sufficiency for future studies that integrate connectomics with functional and behavioral analyses.

      Given that we do not claim sufficiency of the connectome for producing the grooming sequence, we respectfully request that the eLife Assessment avoid framing the manuscript around this expectation, as wording that implies the manuscript should reconstruct the sequence from connectivity could misrepresent the intended scope of the study and potentially mislead readers about its primary contributions.

      Reviewer #2 (Public review):

      Summary:

      Schuler et al. present an extensive analysis of the synaptic connectivity of mechanosensory head bristles in the brain of Drosophila melanogaster. Based on the previously described set of bristle afferent neurons, (BMNs), located on the head, the study aims to provide a complete, quantitative assessment of all synaptic partners in the ventral brain. Activation of head bristles induces grooming behavior, which is hierarchically organized, and hypothesized to be grounded in a parallel cellular architecture in the central brain. The authors found evidence that, at the synaptic level, neurons downstream of the BMN afferents, namely the postsynaptic LB23 interneurons and recurrent GABAergic neurons (involved in sensory gain control), are organized in parallel, following the somatotopic organization described for the BMN afferents. This study, therefore, represents an important step towards a better understanding of the cellular circuits that govern the hierarchical order of sequentially organized grooming behavior in Drosophila melanogaster.

      The study is well done, the images are well designed and extensive in number, but the account is challenging to read and digest for the reader outside the Drosophila /connectome community. It is amazing what can be done with the connectome nowadays using the up-to-date FAFB dataset, the analytical and visual tools (as in FlyWire), in combination with known anatomy/physiology/behavior in DM. I suggest that the authors provide more detail on hemilineages, their relationship to the FAB connectome, the predicted neurotransmitter identity, and the use of statistical CatMAID tools used in some of the Figures.

      A graphical summary at the end of the study would be very useful to highlight the important findings focusing on neuron populations identified in this study and their position in the hypothesized parallel central circuitry of BMNs.

      We thank the Reviewer for the thoughtful and constructive comments. In response, we substantially revised the manuscript to improve clarity and accessibility, particularly for readers outside the Drosophila connectomics community. We rewrote portions of the Introduction, Results, and Discussion to better foreground the main findings, reduce density, and more clearly distinguish prior work from the new analyses presented here. We also added methodological clarification throughout, including how BMN classes were identified in the FAFB dataset using a correlative, type-level approach, how FlyWire/Codex synapse counts are defined, and clarified terminology related to projection zones, pre- versus postsynaptic structure, and partner classes. To address the Reviewer’s request for more developmental context, we added a more explicit definition of hemilineages at first mention in the Abstract and Results. In addition, we revised figures and legends to make the somatotopic and parallel organization of the circuitry easier to interpret, including moving the cosine-similarity matrices into the main figures. Finally, in direct response to the Reviewer’s suggestion for a higher-level synthesis, we added a new graphical summary figure (Figure 15) at the end of the manuscript to highlight the principal neuron populations identified in the study and their proposed positions within the parallel central BMN circuitry. Together, we believe these revisions have made the manuscript clearer, more accessible, and better framed for a broad readership while preserving its core conclusions. Details of these changes are provided in the Recommendations for the authors section.

      Reviewer #3 (Public review):

      Summary:

      The authors set out to extend their previous mapping of Drosophila head mechanosensory neurons (Eichler et al., 2024) by reconstructing their full second-order connectome. Their aim is to reveal how bristle mechanosensory neurons (BMNs) interface with excitatory and inhibitory partners to generate location-specific grooming movements, and to identify the circuit motifs and developmental lineages that support this transformation.

      Strengths:

      The strengths of this work are clear. The authors present a comprehensive synaptic-resolution connectome for BMNs, identifying nearly all of their pre- and postsynaptic partners. This dataset reveals important circuit motifs:

      (1) BMNs provide feedforward excitation to descending neurons, feedforward inhibition to interneurons, and are themselves strongly regulated by GABAergic presynaptic inhibition.

      (2) These motifs together support the idea that BMN activity is locally gated and hierarchically suppressed, fitting well with known behavioural sequences of grooming.

      (3) The study also shows that connectivity preserves somatotopy, such that BMNs from neighbouring bristle populations converge onto shared partners, while distant BMNs remain segregated.

      (4) A developmental analysis reveals both primary and secondary partners, suggesting a layered scaffold plus adult-specific elaborations.

      (5) Finally, the identification of hemilineage 23b (LB23) as a core postsynaptic pathway - incorporating previously described antennal grooming neurons (aBN2) - provides a striking link between developmental lineage, anatomical connectivity, and behavioral output.

      (6) Together, the dataset represents a valuable resource for the neuroscience community and a foundation for future functional studies.

      Weaknesses:

      There are also some weaknesses that mostly only limit clarity.

      (1) The writing is dense, with results often presented in a cryptic fashion and the functional implications deferred to the discussion. As a result, the significance of circuit motifs such as BMN→motor or reciprocal inhibitory loops is sometimes buried, rather than highlighted when first described.

      We thank the Reviewer for this helpful suggestion. In response, we revised several sections of the Results to improve clarity and more clearly highlight the functional significance of key circuit motifs when they are first introduced. Specifically, we streamlined dense passages and added brief explanatory statements linking motifs such as reciprocal inhibitory loops to their potential roles in the proposed parallel circuit architecture. Additional details of these revisions are provided in the Recommendations for the authors section below.

      (2) Some assumptions require more explanation for non-specialist readers - for example, how bristle identity is inferred in EM in the absence of cuticular structures, or what is meant by "ascending" and "descending" in a dataset that does not include the ventral nerve cord. While some of this comes from the earlier paper, it would help readers of this one to explain this.

      In response, we added clarifying text describing how BMN types were identified in the FAFB dataset using a correlative approach based on stereotyped projection morphologies and prior light-level anatomical data, and we explicitly state the limits of this type-level assignment in the absence of cuticular bristles in the EM volume. We also expanded the explanation of partner categories, including what is meant by “ascending” and “descending” neurons in a brain-only dataset. Additional details of these revisions are provided in the Recommendations for the authors section.

      (3) Visualization choices also sometimes obscure key conclusions: network graphs can be visually appealing but do not clearly convey somatotopy or BMN-type differences; heatmaps or region-level matrices would make the parallel, block-like organization of the circuit more evident.

      We incorporated connectivity matrices (cosine-similarity heatmaps) into the main figures to more clearly illustrate the somatotopic and parallel organization of BMN connectivity, complementing the network graph visualizations (new Figure 9). These matrices make the block-like structure of BMN partner relationships more apparent and help highlight differences among BMN types; additional details are provided in the Recommendations for the authors section.

      (4) The data might also speak to roles beyond grooming (e.g., mechanosensory modulation of posture or feeding), and a brief acknowledgement of this would broaden the impact.

      We added text acknowledging that BMNs contribute to additional behaviors beyond grooming, such as feeding and other mechanosensory-guided actions. These roles are supported by prior studies of bristle function and are also consistent with the diverse downstream circuits revealed in the connectome. This clarification broadens the interpretation of the dataset while maintaining the primary focus of the study on grooming-related circuitry.

      (5) The restriction to one hemisphere should be explicitly acknowledged as a limitation when framing this as a 'comprehensive' connectome.

      We thank the Reviewer for this suggestion. We now explicitly acknowledge this limitation in both the Results and Discussion.

      In the Results section entitled “The BMN connectome” we added a sentence at the end of the paragraph that mentions the limitations. This sentence reads: “In addition, because our analysis was restricted to BMNs entering the left hemisphere, the complete right-side BMN connectome is not included, limiting assessment of bilateral symmetry, inter-hemispheric coordination, and variability across sides.”

      The last paragraph of the first Discussion section describes limitations to our ‘comprehensive’ connectome. The text in this paragraph pertaining to the left/right variability reads: Second, the analysis focuses only on BMNs from the left hemisphere. Although contralateral neurons synapsing with left-side BMNs are included, the absence of the right-side BMN connectome limits assessment of bilateral symmetry, interhemispheric coordination, and side-to-side variability.

      Overall, the authors achieve their main goal: they convincingly show that BMNs connect into parallel, somatotopically organized pathways, with LB23 providing a key lineage-based link from sensory input to grooming output. The dataset is carefully analyzed, and while the presentation could be streamlined, the connectome will be a valuable resource for researchers studying sensory processing, motor control, and the logic of circuit organization.

      Recommendations for the authors:

      Reviewing Editor Comments:

      We enjoyed this work and are enthusiastic about its contribution: the resource is valuable, and the anatomical evidence is solid. Most of our suggestions concern clarity and visualization, detailed below.

      In addition, the editors and reviewers felt one focused analysis would materially strengthen the paper: please use the BMN→second-order synapse weights to produce a similarity-based, one-dimensional order of BMN types and test its agreement with the known grooming sequence (e.g., via a rank correlation). A positive result would support sufficiency of the mapped wiring for the sequence; if not, the claims can be framed as "consistent with" rather than "sufficient for."

      We appreciate the Reviewers’ interest in how BMN connectivity relates to the grooming sequence, and agree that understanding how mechanosensory circuits contribute to hierarchical action selection is an important direction. In this study, however, our goal was not to test whether connectivity alone is sufficient to reconstruct the full grooming sequence. Rather, we focused on defining the parallel circuit architecture underlying individual grooming movements and on identifying anatomical features—most notably extensive presynaptic inhibition—that are consistent with previously proposed models of hierarchical suppression.

      We recognize that references in the Introduction to prior work on the grooming sequence may have led some readers to expect a direct sequence-prediction analysis. To address this, we revised the Introduction and Results to clarify scope and adjusted language to avoid implying that we aimed to derive the grooming order from connectivity. Consistent with this framing, the Abstract mentions the sequence only in the context of presynaptic inhibition, which provides anatomical support for existing models of hierarchical suppression. We do not draw conclusions about the ordering of grooming movements from the connectome itself.

      The Reviewer-suggested analysis—using BMN-to-partner synaptic weights to derive a linear ordering of BMN types—is conceptually reasonable, but its interpretability is limited at present. First, synaptic weight alone does not align with known features of the grooming sequence: BM-Taste neurons contribute the majority of BMN synaptic output, yet proboscis grooming is not the first head movement, whereas BM-InOm neurons contribute less than 9% of output despite eye grooming occurring first. Second, BMNs likely project to multiple pathways supporting distinct behaviors, such as feeding and escape, complicating any attempt to infer a single grooming hierarchy from aggregate connectivity. Third, the head grooming sequence itself has not been resolved at the granularity required for such an analysis, particularly among the antennae, proboscis, and other head bristle regions. Accordingly, we have deliberately refrained from making claims that connectivity is sufficient to generate the grooming order.

      Given that we do not claim sufficiency of the connectome for producing the grooming sequence, we respectfully request that this point be removed from the public eLife Assessment, as its current wording implies an unmet expectation outside the intended scope of the study and could mislead readers about the manuscript’s primary contributions. We appreciate the opportunity to clarify our framing and to ensure that the goals and outcomes of the work are accurately represented.

      Revisions.

      (1) Gave the last paragraph of the Introduction more structure to clearly state the main findings of the study in the context of what we learned about the circuit architecture proposed by the parallel model of hierarchical suppression.

      New paragraph: “Here, we define the synaptic connectivity of head BMNs by mapping nearly all of their pre- and postsynaptic partners—including other BMNs, ascending and descending neurons, interneurons, and motor neurons—within the FAFB dataset. Consistent with a parallel model, we find that both presynaptic and postsynaptic partners are somatotopically organized, preserving the spatial layout of the bristle map and revealing a set of parallel mechanosensory pathways that correspond to distinct head regions. Within the postsynaptic population, we identify the developmentally-related cholinergic hemilineage 23b (LB23), whose members exhibit region-specific BMN connectivity and include neurons previously shown to elicit aimed head grooming movements when activated. This demonstrates how LB23 neurons participate in parallel postsynaptic pathways that may drive discrete components of head grooming. On the input side, BMNs receive substantial presynaptic inhibition from predominantly GABAergic partners, providing strong feedback and feedforward control over mechanosensory signaling. This inhibitory architecture is consistent with hierarchical-suppression models in which inhibition regulates sensory gain and prioritizes competing actions in the grooming sequence. Together, this mechanosensory connectome reveals core organizational principles—parallel somatotopic architecture, region-specific excitatory pathways, and strong inhibitory regulation—that are thought to constitute foundational circuit motifs supporting head grooming.”

      (2) In the Results section entitled “BMN synapses show large quantitative variation across types”, we added text to the third paragraph that makes it clear that raw synapse numbers alone do not predict the sequence, if one just compares the first movement (eye grooming) and a later movement in the sequence (proboscis grooming).

      That text reads: “Notably, if grooming order were driven simply by relative sensory drive—i.e., by BMN types with the strongest synaptic output eliciting cleaning of their corresponding locations first—then synapse number should track the grooming sequence. Instead, differences in synapse number do not align with the order of the grooming sequence: BM-Taste neurons account for the majority of BMN output, yet proboscis grooming is not the first head grooming movement performed, whereas BM-InOm neurons contribute only a small fraction of output despite eye grooming occurring first (Figure 1E, Figure 2A,B). This indicates that global synapse number alone is not a reliable predictor of the grooming sequence.”

      (3) In the results section entitled “BMN postsynaptic partners are excitatory and inhibitory”, we added text to two different sentences to better link the results with what we are trying to test with respect to the parallel model of hierarchical suppression.

      Modified sentence 1: “This excitation is hypothesized in the parallel model to help form BMN feedforward circuits that elicit aimed grooming of specific body locations, while feedforward inhibition could mediate suppression of competing grooming movements (Figure 1 – figure supplement 1A, B).”

      Modified sentence 2: “Taken together, the BMN postsynaptic partners include a diverse set of neurons that mediate both feedforward excitation and inhibition and feedback inhibition, features predicted by the parallel model.”

      (4) In the Results section entitled “BMNs and LB23 neurons form somatotopic pathways that elicit aimed grooming, we added text to the first sentence that better ties the section to the overall goals of the manuscript.

      That text now reads: “In accordance with the parallel model of grooming, we hypothesize that BMNs connect with somatotopically organized excitatory parallel pathways eliciting aimed grooming of specific head locations (Figure 1 – figure supplement 1A, C).”

      Reviewer #1 (Recommendations for the authors):

      (1) The connectivity matrix (like that in Lesser et al., 2024, Nature, and also in Figure 9, Figure Supplement 1 of this paper) is an easier-to-digest representation of the various connections shown in Figure 2.

      We agree that connectivity matrices provide a clearer and more accessible representation of these data. Based on the context of this and other comments, we understand the Reviewer to be referring to Figure 9 rather than Figure 2. In response, we have moved the cosine-similarity connectivity matrices previously shown in Figure 9 – figure supplement 1 into the main manuscript, where they now appear as Figure 9.

      These matrices depict similarity among BMN postsynaptic partners. At present, we are unable to generate equivalent matrices for presynaptic partners due to recent personnel constraints in the lab. For this reason, we have retained the original network-graph representation (now Figure 10) to display the full pre- and postsynaptic connectome structure.

      We hope this compromise addresses the Reviewer’s request while clearly presenting the available analyses.

      (2) Again, "Cosine based clustering is essential to demonstrate the somatotropic organization" the data in Figure 9 - Figure Supplement 1 demonstrates this better than the main Figure 9. This supplementary figure would be a great addition to the main manuscript.

      Please see the preceding response for details on the changes that we made to address this reviewer comment.

      (3) Figure 9 - Figure Supplement 1A: Can the authors explain why the InOm occur in two clusters (red in top and bottom)? Do InOm neurons show two different kinds of connectivity patterns?

      This is a great question! We had written a possible explanation for this in the Discussion section entitled “A synaptic resolution connectome of a head somatotopic map”.

      “One notable exception to this pattern is the BM-InOm population, which occupies a central position in network diagrams and exhibits broad connectivity similarity with BMNs from across the head (Figure 9A, Figure 10A-E). This likely reflects the large surface area of the compound eyes, which span dorsal, ventral, and posterior regions and neighbor multiple bristle populations. Consistent with previous work showing morphological diversity among BM-InOm neurons (Eichler et al., 2024), our output connectivity analysis suggests the presence of multiple BM-InOm subtypes defined by distinct partner profiles (Figure 9A). Future work will be needed to determine how this heterogeneity relates to spatial organization within the eye.”

      Reviewer #2 (Recommendations for the authors):

      All further comments for the authors are aimed at a better understanding of the text and for clarity. The manuscript needs revision.

      (1) Ventral brain:

      Please specify this term. Is it the SEG, or the gnathal ganglion? Throughout the paper, 'ventral brain', or 'brain', is the only anatomical terms you use. Are all pre-/post- partners of BMNs located in this region? I understand that you provide a statistical analysis on a network level, here, but as far as I know, the neuropil regions in Drosophila are reported in more detail on the macroscopic level (see, e.g., Itoh).

      Based on our understanding of the Ito et al reference, SEG was “retired” in that manuscript in favor of gnathal ganglia. We considered using the term subesophageal zone (SEZ) in the manuscript, but ultimately chose not to adopt it. In the Drosophila brain nomenclature (Ito et al., 2014), the SEZ is defined as a region below the esophagus that encompasses multiple neuropils, such as the gnathal ganglia (GNG) and saddle (SAD), rather than a single anatomically discrete structure.

      In our dataset, the GNG are the ventral-most neuropil containing the BMN projections and the highest density of BMN-related synapses, and we therefore refer to this structure explicitly where appropriate. However, BMN pre- and postsynaptic partners are not confined to the GNG or to the SEZ as a whole; some partner neurites extend dorsally into additional neuropils. As a result, the term SEZ does not accurately capture the full spatial extent of the BMN connectome analyzed here.

      For clarity and consistency across analyses that span multiple adjacent neuropils, we therefore use the broader functional descriptor “ventral brain”, while explicitly identifying the gnathal ganglia and other neuropils when discussing neuropil-level synapse distributions. We believe this approach most accurately reflects both the anatomical organization of the circuit and the scope of our analysis.

      Given this Reviewer’s comment, we anticipate that not mentioning the SEZ in this manuscript might result in similar confusion among readers of our manuscript. Therefore, we now mention the SEZ and the supraesophageal zone (SPZ) at the end of the Results section entitled “Synapses of BMN partners are mostly concentrated in the ventral brain”. We also added the SEZ and the SPZ to the new last summary figure (Figure 15) to help clarify the locations of the BMNs and their second order connectome.

      That text reads: “Thus, while most neuropils containing synapses of second-order BMN partners are located below the esophagus (in the subesophageal zone, SEZ), we found more limited involvement of neuropils in the supraesophageal zone (SPZ; above the esophagus), suggesting relatively limited direct top-down control.”

      (2) Please provide greater clarity in your use of the terms synapse-presynapse-pre- and postsynaptic partners:

      In insects, synapses are polyads. It is therefore essential to distinguish whether by presynaptic (pre) you mean 1. the number of T-bars (presynaptic sites) or 2. the number of (outgoing) synaptic contacts made by a single presynaptic T-bar site. For example, a synapse configured as a tetrad (a polyad) consists of one presynaptic T-bar opposed to four postsynaptic profiles and can be counted either as one synapse (one presynaptic site, one T-bar, in CATMAID: a presynaptic connector) OR as four (outgoing) synaptic connections since the single T-bar connects to four different postsynaptic profiles. This distinction is crucial for quantifying synaptic networks in insects. Thus, the "number of synapses" may refer to 1. The number of presynaptic sites = number of T-bars = number of polyads formed by a particular neuron. 2. the number of actually outgoing synaptic contacts, a number that also reflects the degree of polyadicity. 3. number of postsynaptic sites (that is easy).

      This distinction (regarding the counts of presynapses) was reported in previous connectome studies (e.g., Horne, 2018; Gruber, 2025; Schlegel,2023). Schlegel notes: ' Insect synapses are polyadic, i.e., each presynaptic site can be associated with multiple postsynaptic sites. In contrast to the Janelia hemibrain dataset, the synapse predictions used in FlyWire do not have a concept of a unitary presynaptic site associated with a T-bar. Therefore, presynapse counts used in this paper do not represent the number of presynaptic sites but rather the number of outgoing connections.' End of citation from Schlegel.

      We thank the Reviewer for highlighting this important distinction. We now clarify in the Materials and methods that synapse counts are based on Codex/FlyWire annotations, which report individual pre- and postsynaptic contacts rather than unitary presynaptic sites (T-bars), consistent with prior FlyWire-based connectome studies (e.g., Schlegel et al.). We also added a brief clarification in the Results indicating that pre- and postsynaptic numbers refer to incoming and outgoing contacts.

      We added a sentence to the first section of the Materials and methods entitled “Connectome data and neuron meshes”. This text reads: “Synapse counts throughout this study are based on FlyWire/Codex synapse annotations and represent the number of individual pre- to postsynaptic contacts (incoming or outgoing connections), rather than the number of presynaptic active sites (T-bars); thus, presynaptic counts reflect polyadic connectivity as described previously (Schlegel et al., 2023).”

      (3) In your study, a potential misunderstanding of this distinction arises when comparing statements on line 168 versus line 184:

      On line 168, you state: '... each BMN type having .... more postsynaptic than presynaptic sites'. However, on line 184 you state: 'There were significantly more postsynaptic than presynaptic partners, in agreement with the BMNs containing more presynaptic than postsynaptic structures. These are contradictory: the statement on line 168 seems to refer to the number of presynaptic T-bars, while on line 184 you refer to the number of actually outgoing connections (which more accurately reflects the degree of polyadicity). Since BMNs are sensory afferent, they are indeed expected to have more outgoing synapses into the central brain.

      We thank the Reviewer for identifying this mistake. We have revised the sentence at former line 168 to now read: “In addition to differing in total synapse number, BMN types vary in their pre- versus postsynaptic composition: all BMNs contain both (Eichler et al., 2024), with presynaptic sites outnumbering postsynaptic sites by ~2× to ~9× across types (mean ≈5:1 output-to-input ratio, Figure 2 – figure supplement 1A, B, Supplementary file 2, Supplementary file 3).”

      (4) Identification of bristle sensory afferents in the brain:

      This is explained in more detail in the Eichler paper, but not here. I do not understand how you identified these neurons in the FAFB dataset. The number and distribution of the individuum of the FABF EM dataset are not known, and because there is variability in the number of bristles in individual flies, the true number of bristle neurons for synaptic analysis can only be estimated. The correlative approach necessary to find the bristle sensory neurons in the FAFB set is still unclear to me. See also my comments on Figure 1.

      We thank the Reviewer for raising this point. We agree that our original draft did not clearly explain the correlative approach used to identify head BMNs in the FAFB dataset, and we have revised the manuscript to make this workflow explicit.

      In our prior work (Eichler et al., 2024), we quantified the number of bristles in each head bristle population and assessed the extent to which populations are invariant versus variable across individuals. This established an expected range for BMN counts by bristle population and clarified the level of variability that can be expected biologically.

      We then identified BMN types corresponding to specific bristle populations using different techniques, such as dye fills and light microscopy, which allowed us to define the characteristic projection morphologies and CNS entry routes associated with each population. These light-level anatomical signatures provided the basis for locating the corresponding axons in the FAFB EM volume and reconstructing the same neuron classes in EM. Importantly, because bristles themselves are not present in the EM volume, this approach supports type-level assignment (bristle population/BMN class) rather than single-bristle resolution, and we now state this explicitly to avoid overinterpretation.

      To ensure this is clear to readers who have not read Eichler et al., we have added explanatory text in the Results and expanded the Figure 1 legend describing: (i) how BMN types were identified and matched, (ii) what can and cannot be resolved given natural bristle-number variability, and (iii) how this impacts interpretation of “completeness” at the level of BMN types rather than individual bristles.

      In paragraph 1 of the first Results section, entitled “BMN synapses are somatotopically distributed in the ventral brain”, we added text that briefly describes the previous linkage of the head BMNs to the FAFB dataset. That text reads: “In prior work (Eichler et al., 2024), we showed that head bristle populations are innervated by specific BMN types whose axons project to distinct, spatially localized regions (projection zones) in the ventral brain (Figure 1C,D, left, Figure 1 – figure supplement 2A-E). This was determined using dye fills and light-microscopy-based tracing to identify BMN types innervating defined head bristle populations and to establish their characteristic brain projection morphologies. Bristle population counts and their variability across individuals provided expectations for BMN number per type. This quantitative constraint, combined with the highly stereotyped projection morphologies, provided a correlative anatomical framework to locate and reconstruct nearly all BMNs in the FAFB serial-section EM volume and map their projections into the CNS. Because FAFB does not include the head cuticular bristles, individual BMNs could not be linked to single bristles. Therefore, these assignments are necessarily correlative and provide type-level (population) rather than single-bristle resolution. Nevertheless, this level of resolution was sufficient to define somatotopically organized projection zones."

      (5) Results:

      (a) Line 102: explain hemilineage 23 B

      We added text in the manuscript to better define hemilineages.

      In the Abstract, we added to a sentence that highlights that the LB23 neurons are developmentally related. That sentence now reads: “We identified an excitatory cholinergic hemilineage (hemilineage 23b), a developmentally related group of neurons that elicits aimed head grooming and exhibit differential connectivity with BMNs from distinct head locations, revealing a lineage-based somatotopically organized parallel circuit architecture.”

      Results section entitled “The entire cholinergic hemilineage 23b (LB23) is postsynaptic to BMNs”, we added a sentence that defines hemilineage at its first mention in the Results section. We also made slight modifications to the preceding and following sentences. That text reads: “To identify neurons crucial for establishing the BMN-postsynaptic parallel pathways that elicit head grooming movements, we focused on secondary hemilineages. In the Drosophila CNS, a hemilineage refers to the cohort of neurons derived from a single stem cell-like neuroblast that share a common developmental origin, stereotyped morphology, and are thought to have related functional roles within a circuit (Harris et al., 2015; Wreden et al., 2017). This focus was motivated by earlier findings that neurons whose activation elicited head grooming had morphologies consistent with specific hemilineages (Hampel et al., 2015; Seeds et al., 2014).”

      (b) Line 151: - line 171: it is not clear to me what a projection zone is.

      We thank the Reviewer for raising this point. We agree that the term “projection zone” benefits from a brief clarification. We have made minor edits at two locations to explicitly state that projection zones refer to spatially localized regions of BMN axonal arborization and synaptic distribution corresponding to specific head locations.

      Changes made in the manuscript:

      A sentence that first introduces the term in the fourth paragraph of the Introduction now reads: “Indeed, the BMN axon projections in the central nervous system (CNS) show a somatotopic arrangement, where distinct projection zones—spatially localized regions of axonal arborization and synaptic output—correspond to specific head and body locations (Eichler et al., 2024; Johnson and Murphey, 1985; Murphey et al., 1989; Newland, 1991; Newland et al., 2000; Tsubouchi et al., 2017).”

      In a sentence in the first paragraph of the first Results section, we added a brief clarifying definition of “projection zones” at their first mention in the Results. That sentence reads: In prior work (Eichler et al., 2024), we showed that head bristle populations are innervated by specific BMN types whose axons project to distinct, spatially localized regions (projection zones) in the ventral brain (Figure 1C,D, left, Figure 1 – figure supplement 2A-E).

      (c) Input-output versus presynapse-postsynapse?

      A revised sentence in the last sentence of the Results section makes this distinction clear: In addition to differing in total synapse number, BMN types vary in their pre- versus postsynaptic composition: all BMNs contain both (Eichler et al., 2024), with presynaptic sites outnumbering postsynaptic sites by ~2× to ~9× across types (mean ≈5:1 output-to-input ratio, Figure 2 – figure supplement 1A,B, Supplementary file 2, Supplementary file 3).

      (6) Figures:

      For clarity, it would be helpful if you indicated by the arrow the name of the sensory location (antenna, eye, etc.).

      We appreciate this suggestion. Major sensory locations corresponding to different head bristle populations are indicated in Figure 1 – figure supplement 1C. We explored adding these labels directly to Figure 1A, but found that doing so made the panel overly crowded and less clear. To improve visibility while keeping the main figure uncluttered, we now explicitly direct readers to this figure supplement in the Introduction.

      Specifically, we added a reference to Figure 1 – figure supplement 1C in the following sentence in the Introduction: Dust-induced head grooming is performed by the forelegs that start with the eyes and progress to other locations such as the proboscis and antennae (major head locations shown in Figure 1 – figure supplement 1C) (Seeds et al., 2014).

      (a) Figure 1:

      A: the presence of bristle types on the head. Are the JO afferents you mention in the text reported here?

      Figure 1 does not include the JONs, which were described in detail in our previous study (Hampel et al., 2020).

      The JONs are mentioned in the Figure 1 – figure supplement 1. We have added text to this legend to indicate that the JONs are not the subject of this study. This text reads: “(C) Mechanosensory neurons from different head locations project to distinct, somatotopically organized zones in the ventral brain and elicit aimed grooming of those locations, including the antennae (via JONs [Johnston’s organ neurons; not analyzed in this study] and BMNs), eyes (BMNs), and proboscis (BMNs).”

      Are the reconstructions shown 1 B-D also from the Eichler paper?

      We regret that this was not explicitly stated in the figure legend, and have revised the legend to distinguish between what was previously published and what is new to this study.

      In the Figure 1 legend, we revised the following sentence: (C, D) Reconstructed BMN projections in the ventral brain (left, previously described in (Eichler et al., 2024)) and their corresponding pre- and postsynaptic sites (right, this study), colored by type according to the bristles that they innervate.

      To make this clearer in the main text, we have rewritten the first sentence in the first paragraph of the Results: In prior work (Eichler et al., 2024), we showed that head bristle populations are innervated by specific BMN types whose axons project to distinct, spatially localized regions (projection zones) in the ventral brain (Figure 1C,D, left, Figure 1 – figure supplement 2A-E).

      The dots are symbolic, or do they represent the number of bristles? The number of bristles cannot be identified, and thus stems from the FABF dataset.

      The dots are symbolic and do not represent the number of bristles in the FAFB dataset. As noted in response to a related reviewer comment above, the numbers and variability of head bristles were quantified in our prior work (Eichler et al., 2024). We also used dye fills and light-microscopy approaches, which provided the framework for linking BMN types to bristle populations. We have clarified this point in the revised manuscript, as described in the response above.

      Synapse number of bristle afferents: number of all pre-and postsynaptic contacts?

      We have addressed this point above.

      (b) Figure 2:

      Again, the term synapses refers to all pre-and postsynaptic contacts ?

      The Figure 2 legend indicates that synapse numbers include both input and output synapses. Additionally, now the first reference to Figure 2 indicates that numbers refer to both input and output synapses.

      (c) Figure 2:

      Supplement presynaptic/postsynaptic means pre- and post partner?

      Presynaptic: number of BMNs that were connected with at least 5 synapses to any given presynaptic partner (n), the numbers of synaptic inputs to BMNs (inputs), and the number of presynaptic partners (partners). Postsynaptic: number of BMNs that were connected with at least 5 synapses to any given postsynaptic partner, the numbers of synaptic outputs to postsynaptic partners, and the number of postsynaptic partners.

      (d) Figure 3:

      Explain downstream-upstream

      Downstream refers to postsynaptic while upstream refers to presynaptic partners or pathways.

      Comparing the right side of the Sankey d. with your diagram in B, just by judging, I see more partners of descending (post) than interneurons (post) in A. However, in B, there are clearly more postsynaptic interneurons than descending posts? There are no numbers in Figure 3A.

      This is a great point! Figure 3A (the Sankey diagram) summarizes the fraction of BMN synaptic output distributed across partner classes, normalized within each BMN type. In this representation, descending neurons occupy a larger fraction because, across BMN types, they collectively receive a higher proportion of BMN output synapses.

      In contrast, Figure 3B (the sunburst plot) summarizes the number of distinct postsynaptic partner neurons in each category. Here, interneurons are more numerous than descending neurons, even though individual interneurons tend to receive fewer BMN synapses on average.

      Thus, the two plots are consistent: descending neurons are fewer in number but receive more synapses per neuron, whereas interneurons are more numerous but receive fewer synapses per neuron on average. When postsynaptic synapse counts are summed (as in the bottom plots), the totals for descending neurons and interneurons can therefore appear similar, despite their different representations in the Sankey diagram.

      We have added text in the Results section entitled “BMN synaptic partners in the CNS: ascending, descending, and interneurons”. Text was added here because it also nicely responds to another Reviewer comment below for more description of the postsynaptic partners. That added text reads: “Interneurons are more numerous as distinct partner neurons, whereas descending neurons receive a larger fraction of BMN output synapses across BMN types (Figure 3A,B). Thus, descending neurons are fewer in number but tend to receive more BMN synapses per neuron on average, while interneurons are more numerous but often receive fewer synapses per neuron.”

      (e) Figure 10: I cannot see colored circles. I found Figure 10 very hard to understand. Is this a visualization created in CATMAID? As I mentioned before, a graphical summary highlighting the information flow and architecture of the circuits analyzed in this study would be useful. In such a diagram, you could combine the findings of your study, the open question, and the undeciphered pathways. In short, a schematic of the current knowledge of the potentially parallel and recurrent architecture of the BMN circuitry.

      Figure 10 (now Figure 11) is intended to specifically examine neurons that are both pre- and postsynaptic to BMNs, rather than to summarize the full connectome. The goal of this figure is to highlight two features of pre/post neurons: their somatotopic connectivity with BMN types and the presence of bilaterally symmetric neuron pairs that connect to common BMN populations.

      This visualization was generated from connectome-derived connectivity data and not from CATMAID, although it uses neuron reconstructions and synapse annotations from the FAFB dataset. The colored nodes represent BMN types and are now consistently referred to as “dots” rather than “circles” to better match their appearance. We have simplified the figure legend to clarify these points.

      In response to this and related comments, we also added a new graphical summary figure (Figure 15) at the end of the manuscript that schematically summarizes the information flow and parallel, recurrent architecture of the BMN circuitry at a higher level.

      (7) Discussion:

      I found the first part of your discussion hard to read; the second part is better. You can condense the discussion by mentioning the results/hypothesis of previous work once, and avoiding repetitions, such as the uniqueness of the BMN connectome/FAB dataset.

      In response to this comment, we condensed the opening portion of the Discussion by reducing repetition of background and prior findings, particularly references to earlier BMN work and the uniqueness of the FAFB dataset. We streamlined overlapping sections, mentioned prior hypotheses and results only once, and focused the revised text more directly on the new contributions of this study—namely, the synaptic-resolution organization, somatotopic connectivity, and circuit principles revealed by the BMN connectome.

      There are several cases of vague sentences, e.g.: a) Line 827: 'Head BMNs project from bristles to somatotopically organized zones in the brain (? ventral brain ?), with those innervating neighboring populations (? of bristles ?) occupying overlapping zones (Figure 1A-D)'.

      We made this suggested change: Head BMNs project from bristles to somatotopically organized zones in the ventral brain, with those innervating neighboring bristle populations occupying overlapping zones (Figure 1A-D).

      A remark: maybe you should indicate in Figure 1D the overlapping and segregated zones. The resolution is very low in these images.

      We thank the Reviewer for this comment and agree that overlap versus segregation of projection zones was not sufficiently guided in the original presentation. Rather than adding arrows to Figure 1C,D, which we felt would reduce clarity, we now explicitly describe how overlap and segregation can be identified based on color mixing of BMN synapses in the text and figure legend. In addition, we highlight these features more clearly in Figure 1 – figure supplement 3, which provides higher-resolution, multi-view visualizations of BMN synapses where overlap and non-overlap are most evident.

      Results:

      Segregation between projection zones is apparent where synapses of distinct BMN types occupy non-overlapping regions with little or no color mixing, whereas overlap between projection zones is visible as spatial intermixing of differently colored synapses from neighboring BMN types (Figure 1C, D, right, Figure 1 – figure supplement 3A-E).

      Figure 1 legend:

      Overlapping projection zones are evident where synapses of different BMN types spatially intermingle, whereas segregated zones show little or no color mixing.

      Figure 1 – figure supplement 3 legend:

      These views highlight both overlapping projection zones, visible as intermingled synapses of different colors from neighboring BMN types, and segregated zones, where synapses from distinct BMN types remain spatially separated with minimal color mixing.

      (b) Line 860: What is: 'location groomed'?

      Added a clarification to this sentence: Thus, the location groomed (i.e. antennae) corresponds to the location of the majority of BMN inputs.

      (c) Line 944: 'The sensory to motor resolution' What do you mean, here?

      We have revised this sentence to “The spatial resolution of the sensory-to-motor transformation in this parallel circuit architecture remains to be tested.”

      (d) The term: 'neighboring bristles' is unclear. Does it mean 'neighbor relates to members within he same bristle type (antennae)', or 'bristles of different types', e.g. antennae and eye bristles.

      We thank the Reviewer for raising this point. Throughout the manuscript, the term “neighboring bristles” is used primarily to refer to neighboring bristle populations (i.e., bristles from different anatomical groups that are spatially adjacent on the head). In some contexts, the term is also used more generally to describe spatial proximity, regardless of whether the bristles belong to the same or different populations. Importantly, in both cases, the usage reflects the same underlying observation: BMNs innervating bristles that are spatially closer—whether within or between populations—show greater similarity in their postsynaptic connectivity than BMNs innervating more distant bristles.

      (e) Avoid abbreviations, or explain shortly, the term under discuss: line 725: BMlnOm?

      We thank the Reviewer for pointing out that the BMN nomenclature was not sufficiently clear. BMNs are named according to the bristle population they innervate (e.g., BM-Ant neurons innervate antennal bristles; BM-InOm neurons innervate interommatidial eye bristles), as defined in the Figure 1 legend. To improve clarity, we ensured that the first occurrences of these terms in the Results explicitly include the corresponding head location (e.g., “eye BM-InOm neurons”), and we added brief contextual reminders at later points where this abbreviation appears. These changes clarify the meaning of BM-InOm and related abbreviations without introducing additional terminology.

      Changes made:

      Figure 1 legend: clarified that BMNs are named according to the bristle population they innervate (e.g., BM-Taste neurons innervate Taste bristles).

      Results, early first section (second paragraph): added head-location qualifiers at first mention (e.g., “eye BM-InOm neurons,” “proboscis BM-Taste neurons”) in sentences such as: “35 BM-Taste neurons innervating Taste bristles on the proboscis…” and “405 eye BM-InOm neurons innervating the interommatidial bristles on the eyes…”.

      Later Results text where the abbreviation appears (including the sentence addressing the 5-synapse cutoff): added “eye” before BM-InOm for context (e.g., “although 555 eye BM-InOm neurons are present… only 405 meet the five-synapse threshold”).

      (f) LB23 hemilineage: what was that again?

      We added text in the manuscript to better define hemilineages. This is described above in response to another Reviewer suggestion.

      (g) Line 732: What are ascending neurons?

      We had already included a definition of ascending neurons in the second Results section entitled “The BMN connectome”. Since this was not clear to the Reviewers, we expanded on this section. There is now a new paragraph in this same section. This paragraph reads:

      “Partners were grouped into five morphological categories—interneurons, descending neurons, ascending neurons, BMNs, and motor neurons—following FlyWire annotations (Dorkenwald et al., 2024). Interneurons were defined as neurons whose soma and all neurites were confined to the brain. Descending neurons were defined as neurons whose somata are located in the CNS and whose neurites extend into the descending tracts toward the ventral nerve cord (VNC). Conversely, ascending neurons were identified as neurons whose neurites enter the brain through the cervical connective and whose somata lie outside the FAFB imaged volume, resulting in only their neurites being visible in the dataset.”

      (h) Line 896: What is lineage matching?

      We thank the Reviewer for pointing this out. We realized that this sentence did not add clarity and contributed little to the manuscript, so we removed the sentence that used “lineage matching” from the manuscript.

      (i) Line 926: The Previous work ... sentence makes no sense to me.

      The sentence was reworked and now reads: “The mechanosensory neurons hypothesized from the parallel model that elicit the Drosophila grooming sequence were identified in previous work (Eichler et al., 2024; Hampel et al., 2020a, 2017, 2015; Mueller et al., 2019; Seeds et al., 2014; Zhang et al., 2020).”

      (j) The FAB-dataset is indeed unique, but the fact that it is repeated several times in your discussion does not ensure understanding of the obviously complex circuit architecture potentially underlying behavior. Please, focus on your discussion strictly and condense your arguments to the specific contribution and outcome of the data in the current manuscript.

      In response to this comment, we condensed the opening portion of the Discussion by reducing repetition of background and prior findings, particularly references to earlier BMN work and the uniqueness of the FAFB dataset. We streamlined overlapping sections, mentioned prior hypotheses and results only once, and focused the revised text more directly on the new contributions of this study—namely, the synaptic-resolution organization, somatotopic connectivity, and circuit principles revealed by the BMN connectome.

      (k) At some parts of the discussion, it is not clear to me, if you refer to results of the actual study or refer to previous studies (Hampel, Eichler) e.g., 'Our work has shown ...' on line 872.or '...we find ... LB23 neuron elicit antennal grooming....'. or line 909: Our work reveals ......

      Sentence a former line 872 was revised and now reads: “While our past and present work together reveal that a subpopulation of LB23 neurons elicits antennal grooming, we also find evidence that other LB23 neurons in the hemilineage elicit additional head grooming movements.”

      Sentence at former line 909 was revised and now reads: “Our previous work and the present study reveal that the antennal grooming circuit receives inputs from two different classes of antennal mechanosensory neurons, the BMNs and JONs.”

      Reviewer #3 (Recommendations for the authors):

      All my comments are mostly only for clarity.

      (1) It would help readers if the manuscript explicitly stated how a sensory neuron can be postsynaptic - i.e., that BMN axons receive inhibitory inputs in the CNS - since this may not be intuitive to a broader audience.

      We appreciate this comment and added the following text to the last paragraph of the first Results section: As expected for sensory afferents, BMNs provide synaptic output to downstream circuits; however, the presence of postsynaptic sites may be less intuitive, and reflects that BMNs can also receive synaptic input onto their central axons within the CNS.

      (2) Figure 1 is a helpful context, but since much of it is directly reused from Eichler et al., 2024, it would strengthen the presentation if you clarified what is new here (e.g., the synapse quantification) versus what is recap. In addition, for readers less familiar with EM connectomics, it would be valuable to spell out how bristle neurons are assigned to classes in the absence of bristles themselves in the volume - i.e., that classification rests on stereotyped nerve entry and projection zones, which allow type-level but not single-bristle resolution. Explicitly flagging these methodological boundaries up front would make it clearer what information comes from the current work, what derives from previous reconstructions, and what the limits of resolution are.

      We have addressed this recommendation above for a similar suggestion by Reviewer 2 (see above for details). In brief, we inserted an overview of the methodology used to identify BMN types in the FAFB dataset, and we now explicitly state the limitations of this correlative approach. We added a sentence in the first paragraph of the Results section that states, “Because FAFB does not include the head cuticular bristles, individual BMNs could not be linked to single bristles. Therefore, these assignments are necessarily correlative and provide type-level (population) rather than single-bristle resolution.” In addition, we revised the Figure 1 legend to more clearly distinguish panels and reconstructions that were previously reported in Eichler et al. (2024) from synapse quantification and analyses that are new to the present study.

      (3) BMNs from neighboring bristle populations converge onto shared partners, while distant BMNs remain segregated - while the overlap was clear, the segregation was not visually clear in the first figure.

      We thank the Reviewer for this suggestion. We have addressed this point in our response to a similar comment from Reviewer 2 (see above), where we clarified how overlap versus segregation can be identified in Figure 1 and strengthened the text and figure legends to guide readers to these features without adding clutter to the figure.

      (4) The identification of direct BMN → motor neuron synapses is intriguing, but since these inputs make up only a small fraction of motor neuron synapses, it would help if the authors explicitly cautioned readers that these are likely modulatory contributions rather than stand-alone reflex arcs. This would prevent over-interpretation of the sensory-motor link. Similarly with the BMN>BMN connections.

      We thank the Reviewer for this suggestion. We revised the Results section “BMN postsynaptic motor neurons” to more explicitly caution that the direct BMN → motor neuron connections are likely modulatory rather than stand-alone reflex arcs, consistent with their small contribution to total motor neuron input. The revised text reads: “However, BMN inputs accounted for only a small fraction of total synapses onto each motor neuron (≦6.28% of total inputs/BMN type, Figure 4 – figure supplement 1, Supplementary file 7), suggesting a modulatory contribution rather than direct sensory-driven motor activation.”

      (5) Since the FAFB dataset only includes the brain, it would be helpful to clarify what is meant by "ascending" and "descending" partners in this context - namely that ascending neurons are VNC-derived axons entering the brain, while descending neurons are brain-derived neurons projecting out toward the VNC. Explicitly stating this will prevent confusion, given that all BMNs themselves terminate in the SEZ.

      We had already included definitions in the second Results section entitled “The BMN connectome”. Since this was not clear to the Reviewers, we expanded on this section. There is now a new paragraph in this same section. This paragraph reads: Partners were grouped into five morphological categories—interneurons, descending neurons, ascending neurons, BMNs, and motor neurons—following FlyWire annotations (Dorkenwald et al., 2024). Interneurons were defined as neurons whose soma and all neurites were confined to the brain. Descending neurons were defined as neurons whose somata are located in the CNS and whose neurites extend into the descending tracts toward the ventral nerve cord (VNC). Conversely, ascending neurons were identified as neurons whose neurites enter the brain through the cervical connective and whose somata lie outside the FAFB imaged volume, resulting in only their neurites being visible in the dataset.

      (6) In the section titled "BMN synaptic partners in the CNS: ascending, descending, and interneurons", the balance of explanation is skewed toward presynaptic input to BMNs. It would strengthen clarity if you expanded equally on the postsynaptic side (i.e., BMN outputs) or explicitly signposted why the focus here is on inputs. That way, readers won't be left wondering whether outputs are less important or just deferred to later figures.

      We have revised the section that was previously skewed toward presynaptic BMNs. This section also addresses some confusion about interpreting Figure 3, from a critique from Reviewer 2. The section now reads: “Postsynaptic connections were predominantly interneurons (56%), with significant contributions from descending (28%) and ascending (16%) neurons (Figure 5D, F,H,J). Interneurons are more numerous as distinct partner neurons, whereas descending neurons receive a larger fraction of BMN output synapses across BMN types (Figure 3A, B). Thus, descending neurons are fewer in number but tend to receive more BMN synapses per neuron on average, while interneurons are more numerous but often receive fewer synapses per neuron. Together, these partner categories underscore the strong integration of BMNs with local brain circuitry (interneurons), and with pathways linking the brain and ventral nerve cord (VNC), through ascending neurons that provide VNC-derived synaptic input and descending neurons that carry BMN output toward the VNC.”

      (7) The network diagrams in Figure 9 convey clustering, but a complementary heatmap of BMN type × partner connectivity could highlight the parallel organization more clearly. This would make the block-like separation of dorsal, ventral, and posterior subnetworks more immediately apparent, reinforcing the conclusion of parallel somatotopy-based processing. This section would also benefit from drawing the functional message more explicitly: that BMNs form largely independent, somatotopically aligned pathways with regional overlap, supporting the idea of parallel grooming circuits. Right now, the text reads as a connectivity catalog, and the key concept of parallel regional architecture risks being underemphasized.

      We agree that connectivity matrices provide a clear and accessible representation of these data. We have moved the cosine-similarity connectivity matrices previously shown in Figure 9 – figure supplement 1 into the main manuscript, where they now appear as Figure 9. These matrices depict similarity among BMN postsynaptic partners. For this reason, we have retained the original network-graph representation (now Figure 10) to display the full pre- and postsynaptic connectome structure.

      Based on the Reviewer’s suggestion to clearly state the key concepts of the parallel architecture, we added a sentence to the end of the Results section entitled: Somatotopy-based connectivity among BMN synaptic partners in the CNS. That text reads: “Thus, the BMNs form largely independent, somatotopically aligned pathways with regional overlap, supporting the idea of parallel grooming circuits.”

      (8) It would help if the manuscript if the authors explained more explicitly the somatotopy logic (that reciprocal inhibition preserves local head regions, ensuring that suppression and gain control act locally) more clearly. At present, the narrative is buried in network-graph detail - a heatmap or simple region-level summary would make this organizational principle much clearer to readers.

      We thank the Reviewer for this suggestion. To make the somatotopy logic of pre/post feedback inhibition clearer and less buried in network-graph detail, we revised the text in this Results section to more explicitly distinguish (i) reciprocal, head-region–localized inhibitory loops that could support local gain control from (ii) non-reciprocal cross-type inhibitory pathways that could contribute to heterotypic suppression between head regions. In addition, we modified the figure to more clearly convey somatotopy by adding text on the plot and updating the legend to state: “Bold text indicates the general head location of BMNs on the plot, revealing somatotopy-based connectivity with pre/post neurons (i.e. ventral, dorsal, posterior, and the ventral/dorsal transition).”

      (9) Please adjust the section title, "LB23 hemilineage member neurons elicit aimed head grooming movements" to avoid implying new functional experiments. For example:

      (a) "LB23 neurons include previously defined antennal grooming command neurons" or

      (b) "LB23 hemilineage anatomically corresponds to grooming-related neurons".

      This would make it clear that the contribution here is anatomical linkage, not fresh functional data.

      We changed the section title to the Reviewer-suggested title b: LB23 hemilineage anatomically corresponds to grooming-related neurons

      (10) The current network graphs in Figure 13B are not very intuitive - it is hard to visually extract the somatotopy. A connectivity heatmap or matrix (BMN types on one axis, LB23 neurons or subgroups on the other, with synapse strength as colour) would make the block-like, region-specific mapping immediately clear. A coarse-grained version (e.g., dorsal/ventral/posterior BMNs vs LB23 subgroups) could further highlight the parallel, somatotopically organized pathways. This would better support the central claim of Figure 13 than the current spring-layout graphs. Figure 13F does this for BMN inputs onto aBN2 neurons. (But it is presented only in binary form; could the authors not add a graded colour scale proportional to synapse number?)

      The binary form was necessary because the results are from different sources (i.e. Catmaid versus flywire synapse counts) with different synapse numbers.

      We modified the Figure 13B to more clearly convey somatotopy by adding text on the plot and updating the legend to state: “Bold text indicates the general head location of BMNs on the plot, revealing

      somatotopy-based connectivity with LB23 neurons (i.e. ventral, dorsal, and posterior head).” We hope that this modification satisfies the Reviewer.

    1. eLife Assessment

      This important study combines optogenetic manipulations with wide-field cortical imaging to investigate the neural basis of context-dependent sensory processing. It provides compelling evidence that the retrosplenial cortex modulates behavioral responses to whisker deflection depending on the behavioral context. The paper will be of strong interest to neuroscientists studying cortical mechanisms of sensorimotor processing.

    2. Reviewer #1 (Public review):

      Summary

      The strength of this manuscript lies in the behavior: mice use a continuous auditory background (pink vs brown noise) to set a rule for interpreting an identical single-whisker deflection (lick in W+ and withhold in W− contexts) while always licking to a brief 10 kHz tone. Behaviorally, animals acquire the rule and switch rapidly at block transitions and take a few trials to fully integrate the context cue. What's nice about this behavior is the separate auditory cue, which shows the animals remain engaged in the task, so it's not just that the mice check out (i.e., become disengaged in the W- context). The authors then use optical tools, combining cortex-wide optogenetic inactivation (using localized inhibition in a grid-like fashion) with widefield calcium imaging to map what regions are necessary for the task and what the local and global dynamics are. Classic whisker sensorimotor nodes (wS1/wS2/wM/ALM) behave as expected with silencing reducing whisker-evoked licking. Retrosplenial cortex (RSC) emerges as a somewhat unexpected, context-specific node: silencing RSC (and tjS1) increases licking selectively in W−, arguing that these regions contribute to applying the "don't lick" policy in that context. I say somewhat because work from the Delamater group points to this possibility, albeit in a Pavlovian conditioning task and without neural data.

      The widefield imaging shows that RSC is the earliest dorsal cortical area to show W+ vs W− divergence after the whisker stimulus, preceding whisker motor cortex, consistent with RSC injecting context into the sensorimotor flow. A "Context Off" control (continuous white noise; same block structure) impairs context discrimination, indicating the continuous background is actually used to set the rule (an important addition!) Pre-stimulus functional-connectivity analyses suggest that there is some activity correlation that maps to the context presumably due to the continuous background auditory context. Simultaneous opto+imaging projects perturbations into a low-dimensional subspace that separates lick vs no-lick trajectories in an interpretable way.

      In my view, this is a clear, rigorous systems-level study that identifies an important role for RSC in context-dependent sensorimotor transformation, thereby expanding RSC's involvement beyond navigation/memory into active sensing and action selection. The behavioral paradigm is thoughtfully designed, the claims related to the imaging are well defended, and the causal mapping is strong.

      Comments on revisions:

      The authors have been responsive to the prior review and I think the manuscript is a valuable and important addition to the literature.

    3. Reviewer #2 (Public review):

      Summary:

      The authors aim to understand the neural basis of context-dependent sensory processing and decision-making.

      Strengths:

      They used an innovative behavioral paradigm where the action-outcome association changes independent of the sensory stimulus. This allowed the authors to disentangle the effect of behavioral context on sensory processing in RSC. Using this approach combined with optogenetic silencing, they discover that RSC activity is necessary for suppressing a lick response when the stimulus switches to the unrewarded context. The authors provide compelling evidence that the RSC is an important node of context-dependent sensory processing.

      Weaknesses:

      Sensory processing appears to be entangled with jaw/tongue movement initiation. Nonetheless, it is clear that RSC and motor cortex convey contextual signals with a very short latency.

      Comments on revisions:

      Thank you for updating the manuscript. Good work.

    4. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary

      The strength of this manuscript lies in the behavior: mice use a continuous auditory background (pink vs brown noise) to set a rule for interpreting an identical single-whisker deflection (lick in W+ and withhold in W− contexts) while always licking to a brief 10 kHz tone. Behaviorally, animals acquire the rule and switch rapidly at block transitions and take a few trials to fully integrate the context cue. What's nice about this behavior is the separate auditory cue, which shows the animals remain engaged in the task, so it's not just that the mice check out (i.e., become disengaged in the W- context). The authors then use optical tools, combining cortexwide optogenetic inactivation (using localized inhibition in a grid-like fashion) with widefield calcium imaging to map what regions are necessary for the task and what the local and global dynamics are. Classic whisker sensorimotor nodes (wS1/wS2/wM/ALM) behave as expected with silencing reducing whisker-evoked licking. Retrosplenial cortex (RSC) emerges as a somewhat unexpected, context-specific node: silencing RSC (and tjS1) increases licking selectively in W−, arguing that these regions contribute to applying the "don't lick" policy in that context. I say somewhat because work from the Delamater group points to this possibility, albeit in a Pavlovian conditioning task and without neural data. I would still recommend the authors of the current manuscript review that work to see whether there is a relevant framework or concept (Castiello, Zhang, Delamater, 'The retrosplenial cortex as a possible 'sensory integration' area: a neural network modeling approach of the differential outcomes effect of negative patterning', 2021, Neurobiology of Learning and Memory).

      The widefield imaging shows that RSC is the earliest dorsal cortical area to show W+ vs W− divergence after the whisker stimulus, preceding whisker motor cortex, consistent with RSC injecting context into the sensorimotor flow. A "Context Off" control (continuous white noise; same block structure) impairs context discrimination, indicating the continuous background is actually used to set the rule (an important addition!) Pre-stimulus functional-connectivity analyses suggest that there is some activity correlation that maps to the context presumably due to the continuous background auditory context. Simultaneous opto+imaging projects perturbations into a low-dimensional subspace that separates lick vs no-lick trajectories in an interpretable way.

      In my view, this is a clear, rigorous systems-level study that identifies an important role for RSC in context-dependent sensorimotor transformation, thereby expanding RSC's involvement beyond navigation/memory into active sensing and action selection. The behavioral paradigm is thoughtfully designed, the claims related to the imaging are well defended, and the causal mapping is strong. I have a few suggestions for clarity that may require a bit of data analysis. I also outline one key limitation that should be discussed, but is likely beyond the scope of this manuscript.

      Major strengths

      (1) The task is a major strength. It asks the animal to generate differential motor output to the same sensory stimulus, does so in a block-based manner, and the Context-Off condition convincingly shows that the continuous contextual cue is necessary. The auditory tone control ensures this is more than a 'motivational' context but is decision-related. In fact, the slightly higher bias to lick on the catch trials in the W+ context is further evidence for this.

      (2) The dorsal-cortex optogenetic grid avoids a 'look-where-we-expect' approach and lets RSC fall out as a key node. The authors then follow this up with pharmacology and latency analyses to rule out simple motor confounds. Overall, this is rigorous and thoughtfully done.

      (3) While the mesoscale imaging doesn't allow for cellular resolution, it allows for mapping of the flow of information. It places RSC early in the context-specific divergence after whisker onset, a valuable piece that complements prior work.

      (4) The baseline (pre-stim) functional connectivity and the opto-perturbation projections into a task subspace increase the significance of the work by moving beyond local correlates.

      Key limitation

      The current optogenetic window begins ~10 ms before the sensory cue and extends 1s after, which is ideal for perturbing within-trial dynamics but cannot isolate whether RSC is required to maintain the context-specific rule during the baseline. Because context is continuously available, it makes me wonder whether RSC is the locus maintaining or, instead, gating the context signal. The paper's results are fully consistent with that possibility, but causality in the pre-stimulus window remains an open question. (As a pointer for future work, pre-stimulusonly inactivation, silencing around block switches, or context-omission probe trials (e.g., removing the background noise unexpectedly within a W+ or W- context block), could help separate 'holding' from 'gating' of the rule. But I'm not suggesting these are needed for this manuscript, but would be interesting for future studies.)

      We thank the reviewer for the comprehensive summary of our work.

      We also thank the reviewer for highlighting the work from the Delamater group (Castiello et al., 2021), and we now briefly discuss this paper on P. 14 Lines 434-437 writing: “RSC was shown to contribute to negative patterning in behavioral tasks requiring rats to learn that the simultaneous presentation of two stimuli lead to an opposite outcome than each individual stimulus (Castiello et al., 2021).”

      We also agree with the reviewer’s noted ‘Key limitation’ regarding the role of RSC as either maintaining context representation or serving a gating function. The reviewer proposes an exciting set of further experiments inactivating RSC at different time points to investigate when RSC activity is needed. We hope to carry out such experiments in the future. We now include a brief discussion of this interesting point on P. 14-15 Lines 455-459 writing: “First, further inactivation experiments would shed light on the timing at which RSC activity is necessary for the integration of contextual information. Specifically, it would be of great interest to inactivate RSC at different time points such as during the intertrial interval or at the transition between contexts.”

      We have of course also addressed each of the more detailed comments from the “Recommendations for the authors” section, please see below.

      Reviewer #2 (Public review):

      Summary:

      The authors aim to understand the neural basis of context-dependent sensory processing and decision-making.

      Strengths:

      They used an innovative behavioral paradigm where the action-outcome association changes independent of the sensory stimulus. This theoretically allows the authors to disentangle the effect of behavioral context on sensory processing. Using this approach combined with optogenetic silencing, they discover that RSC activity is necessary for suppressing a lick response when the stimulus switches to the unrewarded context.

      Weaknesses:

      Sensory processing appears to be entangled with jaw/tongue movement initiation. Activity in M1 and RSC during auditory-evoked lick responses appears to be identical to activity during whisker-evoked lick responses, indicating that movement initiation is the main driver of M1/RSC activity, rather than changes in the flow of sensory information. If sensory information were the main driver of the initial M1/RSC response, then auditory evoked responses should have a longer latency. Perhaps this is beyond the resolution of the calcium indicator or imaging frame rate. It is not clear from the data shown if differences in S1 activity when comparing W+ and W- stimulation are caused by context-sensitive sensory processing or whisker movement following whisker deflection.

      We thank the reviewer for the comments on our work and we agree that separating sensory processing and movement initiation is very important. In the revised manuscript, we have carried out several new analyses to specifically address the points of the reviewer. The most important point is that context-dependent activity in RSC emerges at ~50 ms after the whisker stimulus, which precedes any differences in movements of the jaw or whisker. Although sensory and motor representations become increasingly entangled after stimulus delivery, we think that the first ~100 ms after the whisker stimulus is a relatively safe period for analysing sensory processing and decision making before overt context-dependent differences in movements.

      Addressing the specific point “Activity in M1 and RSC during auditory-evoked lick responses appears to be identical to activity during whisker-evoked lick responses, indicating that movement initiation is the main driver of M1/RSC activity, rather than changes in the flow of sensory information.” - We have now directly compared the pattern of cortical activity evoked by whisker and auditory stimuli in correct trials in the W+ context (new Figure 3 – figure supplement 2). As expected, activity in wS1/wS2 and A1 is stronger in whisker and auditory trials respectively, following their sensory modalities. However, we also evidence a stronger response of wM1/wM2 in whisker trials as early as 40 to 60 ms following the stimulus, showing the specificity to the whisker system. We also observe a stronger response of RSC to whisker than to auditory stimulus. The auditory and whisker evoked responses are therefore different.

      Addressing the specific point “If sensory information were the main driver of the initial M1/RSC response, then auditory evoked responses should have a longer latency. Perhaps this is beyond the resolution of the calcium indicator or imaging frame rate.” – As stated above, the responses to auditory and whisker stimuli are different.

      Addressing the specific point “It is not clear from the data shown if differences in S1 activity when comparing W+ and W- stimulation are caused by context-sensitive sensory processing or whisker movement following whisker deflection.” - We think that the data shown in Figure 3F-H indicate that differences in S1 activity when comparing W+ and W- stimulation are not directly caused by context-sensitive sensory processing. On P. 9 Lines 270273 we write: “Early after stimulus onset, whisker deflection evoked similar activation of primary and secondary whisker somatosensory cortices (wS1 and wS2) in both W+ and W− contexts.” Indeed, context separation in wS1/wS2 only emerged later than 100 ms, which is indeed confounded by the difference in movement evoked by the sensory stimulus (now quantified in new Figure 3 – figure supplement 4). On the contrary RSC and wM1/2 responses to the whisker stimulus were different in W+ and W- at early time points (~50 ms for RSC and ~80 ms for wM1/2) which is consistent with context dependent sensory processing. At least 2 hypotheses could explain the absence of early difference in whisker evoked activity in wS1/wS2 between W+ and W-. The first one is that sensory activity in wS1/wS2 is not modulated by contextual information at all, while the alternative option would imply that sensory activity is mediated by different neuronal populations depending on context with an overall similar average response. We think this is an interesting question which we hope to address in future experiments using Neuropixels recordings and multiphoton cellular imaging to address the single neuron representation of whisker stimulus in wS1/wS2 according to context in the task presented here.

      We have of course also addressed each of the more detailed comments from the“Recommendations for the authors” section, please see below.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      Suggestions to strengthen the manuscript (no new data collection)

      (1) The block-switch dynamics were clearly demonstrated behaviorally. It would be very powerful to mirror this with an analysis of neural data around the block-switch: how do the various areas adjust immediately after a shift in the continuous contextual sound? Does the RSC show any evidence of changing activity patterns? How does the within-trial activity dynamic look as a function of the number of trials from the context switch? This could be done with the data collected for Figure 3 (for within-trial dynamics), but also for the pre-stimulus baseline activity data (Figure 4A-B).

      We thank the reviewer for raising this interesting point. We have now investigated the change of cortical activity at the transition between contexts (new Figure 3 – figure supplement 5). At the context transition, both to W+ and to W- contexts, we observed a rapid activation of the auditory cortex (new Figure 3 – figure supplement 5A). In addition, there appeared to be a slightly higher activation of RSC when transitioning to W- rather than to W+ (new Figure 3 – figure supplement 5A). In the future, it will be of great interest to further investigate this phenomenon.

      We also evaluated the whisker deflection-evoked responses of the different cortical regions according to the number of whisker trials from context switch (new Figure 3 – figure supplement 5B&C). This analysis revealed that while the sensory response in wS1 and wS2 were constant over the time course of a context block, the response of wM1/2 and especially RSC became progressively lower in the W- context, consistent with the behavioral results in Figure 1 supporting time-dependent contextual integration.

      Overall, these results strengthen the role of RSC and wM1/2 in integrating contextual information to guide the response to the whisker stimulus, and we thank the reviewer for raising this important point.

      (2) It might be useful to state 'earliest among the imaged dorsal cortical areas,' and briefly acknowledge potential subcortical contributors (since those were not explored and could be earlier than cortical areas).

      We agree with the reviewer. In the Summary, on P. 2 Line 39-40 we now write: “Widefield calcium imaging revealed that retrosplenial cortex was the first dorsal cortical area to show context discrimination in response to whisker stimulation”. On P. 8 Lines 257-258, we now write: “To investigate the spatiotemporal neural dynamics underlying task execution, we recorded calcium activity across the dorsal cortex in transgenic mice”. On P. 13 Lines 416-420 we now write: “Functional imaging of cortical activity with two different genetically-encoded calcium indicators each showed similar spatiotemporal dynamics of whisker sensory processing with the earliest contextdependent divergence in signalling being detected in RSC, out of the imaged dorsal cortical areas (Figure 3).” On P. 15 Lines 470-473, we now write: “Finally, it is of course important to note that many subcortical regions (as well as non-dorsal cortical regions, which were not imaged) are likely to contribute importantly to context-dependent task performance.”

      (3) Fit a simple exponential/logistic to lick probability vs time-since-switch (your Figure 1Hstyle analysis) to report a time constant with CIs; it will help quantify the integration of the continuous cue.

      We thank the reviewer for this suggestion. We have fitted an exponential to the grand average data to quantify the time constants for integration of contextual information before the presentation of the first whisker stimulus of the block (see new Figure 1H). On P. 6 Lines 170-173 we now write: “To assess whether this temporal integration would differ between contexts we fitted an exponential to the time evolution of the lick probability. This suggested a faster transition to the W+ context than to the W- context (W+ time constant: 9.4 s, W- time constant: 15.5 s) (Figure 1H).”

      (4) Because catch-trial false alarms are higher in W+ than W−, report per-context d′ and criterion for whisker trials (using signal detection theory); this separates sensitivity from bias and makes the behavioral shift more interpretable. It is also further proof that the behavior is contextual (versus a compound stimulus, for example).

      We have computed the d’ and criterion for the whisker trials in the W- and W+ contexts. (see new Figure 1 - figure supplementary 1D). As suggested by the reviewer, this further supports that the behavior is driven by contextual information.

      (5) For the pre-stimulus seed-correlation analysis, can you regress out the pupil/jaw/whisker activity to confirm whether the context modulation is (or is not) movement-driven? It would be helpful to better understand whether the baseline correlation is driven by differences in lowlevel factors between the contexts, versus the higher-level decision rule/context.

      The reviewer raises an interesting point. However, we did not find a straightforward way to regress out movements, and thus we leave this point for future in-depth analysis. On P. 11 Lines 354-357 we now write: “It is important to note that these context-dependent changes in resting-state functional connectivity could relate to the overt context-dependent movements in the prestimulus baseline (Figure 1I&J) and/or a manifestation of higher-level internal rule representations.”

      (6) For the earliest divergence analysis, is this consistent across animals and across sessions within animals? Can you show per-mouse distributions of first-crossing times (d′>2) for RSC vs wM1/2/wS2? This would help provide confidence in this key finding.

      The d’ presented in Figure 3H is computed as the discriminability between contexts at the population level, meaning that at each timepoint (from Figure 3F) we compared the 2 distributions built on N=6 mice. As such if the divergence between context was not consistent across animals this d’ would be low. That said, as suggested by the reviewer, we further investigated this context divergence at single mouse level and single session level. Our analysis supporting the main finding (Figure 3F-H) is shown in new Figure 3 – figure supplement 3.

      First, we show the results for a single mouse across sessions in Figure 3 – figure supplement 3A. We show the stimulus aligned activity in correct whisker trials in both contexts for the 3 recording sessions. For each session we quantified the main effect size defined as the difference of the trial average between contexts. Plotting the difference of mean response, we consistently observed that RSC ramps-up before wM1/2 for the 3 sessions.

      Second, across all individual mice: we further aggregated the session average responses to show discriminability between context for each region at the single mouse level (Figure 3 – figure supplement 3B). We show that RSC is the first region to exhibit context separation in 4 out of the 6 mice that we recorded. In 2 other mice all regions seemed to show context separation but without clear temporal ordering.

      Finally, when averaging across mice, we observed a clear separation and first discrimination in RSC (Figure 3F-H and Figure 3 – figure supplement 3C).

      Overall, these further analyses suggest that the early divergence of RSC activity appears to be robust with a consistent mean difference in single sessions and single mice, as well as across the population of mice. We think this analysis has strengthened our manuscript and we thank the reviewer for the valuable suggestion.

      (7) For the opto mapping data, could you provide P(lick) effect sizes with CIs per grid site? It would also be nice to summarize the qualitative dichotomy: RSC/tjS1 increases licking in W−; canonical wS1/wS2/wM/ALM decreases licking across contexts (to my understanding).

      We now provide the P(lick) effect sizes for the main cortical areas studied in the paper in Figure 2 – figure supplement 1C. This shows the relative change in lick probability in optogenetic trials compare to control trials for each mouse.

      Reviewer #2 (Recommendations for the authors):

      (1) Do mice move their whiskers after stimulus onset? If so, are these movements dependent on behavioral context? What causes the increase in S1 activity during auditory-evoked response trials?

      To answer the reviewer’s questions we have further investigated whisker movements following the sensory stimuli (whisker and auditory correct trials) in both contexts. The results of this analysis are presented in new Figure 3 – figure supplement 4.

      We find that mice move their whiskers shortly after the whisker stimulus in both contexts. The time course of whisker angle in correct whisker trials is similar in both contexts with a discriminability index (d’) consistently below 1. The whisker speed in response to stimulus is slightly higher in the W+ context compared to W- with a d’ slightly above 1 after ~100 ms. We also observed evoked whisker movements in auditory trials independent of context. Thus, whisker movements are indeed evoked by the sensory stimuli, but the overall context-dependent modulation of whisker movements is weak. The early differences in whisker-evoked cortical activity in W+ compared to W- contexts are therefore more likely related to the integration of contextual information than to differences in evoked movements.

      The reviewer is correct to point out that wS1 activity increases in auditory trials (Figure 3E). The response is initially very weak, but becomes more prominent after ~100 ms following the auditory tone. We do not know the underlying mechanisms, but there are several likely explanations. First, as discussed above, there are indeed some whisker movements evoked in response to the auditory stimulus (Figure 3 – figure supplement 4), which could result in sensory input to wS1. Equally, the increase could relate to licking, given the broad representation of movements in cortex and an appropriate reaction time in auditory trials (Figure 3C). Alternatively, wS1 activity in auditory trials could also be related to input connectivity from auditory cortex, top-down input from frontal cortex or subcortical regions such as high-order POm.

      (2) What do the authors think is causing the W+ vs W- difference in S1/S2 activity approximately 100ms after whisker deflection?

      The late W+ vs W- difference in wS1/wS2 activity could be explained by several factors. First this could be due to the difference in whisker movements after ~100 ms as shown in Figure 3 – figure supplement 4. Second this could be driven by the lick vs no lick activity (see reaction time in Figure 3C for whisker trials ~110 ms). Finally, this could be partly due to some movement independent top-down contextual information reaching wS1/wS2 at late time points. Overall, our claim in the paper is that there was no contextual difference in whisker primary and secondary cortices at early time points (before movement). On P. 9 Lines 270-273 we explicitly write: “Early after stimulus onset, whisker deflection evoked similar activation of primary and secondary whisker somatosensory cortices (wS1 and wS2) in both W+ and W− contexts.” In contrast, our main findings are grounded in the divergence of cortical activity in RSC and wM1/2 at early time points (<100 ms).

      (3) The choice of PC3 seems arbitrary. Is there no task-relevant information in PC1 and PC2?

      We appreciate the point raised by the reviewer and have clarified the reasoning leading to PC3 selection in the main text, where on P. 12-13 Lines 384-391 we now write: “The loadings of the first principal components were uniformly distributed and could reflect a late movement driven activation distributed across all cortical areas (Figure 4 – figure supplement 2C&D). PC2 loadings show variation along the anteroposterior axis that could reflect differences between sensory and motor regions but its time course does not separate between lick and no lick in control conditions (Figure 4 – figure supplement 2C&D). The loadings of PC3 highlighted task-related cortical regions and its time course exhibited clear differences comparing lick and no-lick trials.” In addition, we now also show the time courses for PC1 and PC2 in Figure 4 – figure supplementary 2D.

      Overall, the reasoning is the following:

      PC1 has spatially-homogeneous positive loadings (Figure 4 – figure supplementary 2C) and activity along PC1 gradually ramps up following sensory stimulation (Figure 4 – figure supplementary 2D). It is likely driven by widespread activation of the cortex following the whisker stimulus and the lick response. As such we believe that the taskrelated information captured by PC1 is movement related and not necessarily informative about processing of whisker and context.

      PC 2 has loadings varying along the antero-posterior axis (Figure 4 – figure supplementary 2C), which could be relevant for the task, but its time-course does not discriminate between lick and no lick neither in W+ nor W- (Figure 4 – figure supplementary 2D).

      PC3 has both loadings that vary between several cortical regions involved in the task (Figure 4 – figure supplementary 2C) and a time course that separates between lick and no lick in both contexts (Figure 4 – figure supplementary 2D). We thus focus on PC3 to investigate the effect of optogenetic inactivation on whisker stimulus evoked activity.

      The remaining components beyond PC3 contain a very small fraction of variance and were thus not considered.

      (4) Figure 3 - Supplement 1: What explains the change in fluorescence in GFP/tdT mice during W+ stimulation? Is it brain movement on the z-dimension? Could this explain differences in calcium imaging results?

      We thank the reviewer for this question. The nature of intrinsic signals is a complex topic, but brain movement is unlikely to contribute importantly, because under similar behavioral conditions we (and others) typically find brain movements to be on the scale of a few microns. The three most widely-reported contributions to intrinsic optical changes in cortex relate to:

      (i) Light scattering – as neurons integrate synaptic inputs and fire action potentials, the neuronal elements swell slightly due to the ionic and water fluxes (see for example Vincis et al. Cell Reports 2015, doi: 10.1016/j.celrep.2015.06.016). This reduces the refractive index mismatch between the intracellular and extracellular space. This in turn reduces light scattering, which could result in fluorescence increases.

      (ii) Hemodynamics – changes in blood volume and changes in oxygenation/deoxygenation will change the absorption of light at different wavelengths, in an activity-dependent manner (also forming the basis of BOLD fMRI signals).

      (iii) Flavoproteins – endogenous fluorescent proteins, such as flavoproteins present at high levels in mitochondria, have been reported to change their fluorescence depending upon neuronal activity, presumably in relationship to increased mitochondrial activity.

      We therefore think it is very important to image GFP/tdTomato-expressing mice as controls, and we would suggest that this should be carried out more commonly in the field. Indeed, similar to our results, another study (Yogesh et al., eLife 2025, doi: 10.7554/eLife.104914) recently reported upon the importance of carefully examining intrinsic fluorescence changes, which were found to be present in both wide-field and two-photon imaging of GFP expressing mice.

      Our results reported in Figure 3 – figure supplement 1, show that GFP/tdTomato signals over the first ~120 ms following whisker stimulation were much smaller that the equivalent changes in GCaMP6f/jRGECO1a-expressing mice, and therefore would only have a minor contribution to our analyses. However, we refrained from analysing fluorescence changes at later post-stimulus times, because the intrinsic signals indeed become increasingly prominent as the mice initiate licking.

    1. eLife Assessment

      This is an important contribution that confirms prior evidence that word recognition - a cornerstone of development - improves across early childhood and is related to vocabulary growth. This study is distinguished by its use of a large, multi-study dataset that is uncommon in prior research on cognitive development. It provides compelling evidence that speed, accuracy, and consistency of word learning improve with age, and will therefore prove of interest to those studying language, and more broadly, perception and development.

    2. Reviewer #1 (Public review):

      Summary:

      The study examined the extent to which children's word recognition skill improves across early development, becoming faster, more accurate and less variable, and the extent to which word recognition skill is related to children's concurrent and later vocabulary knowledge.

      The main strength of the study comes from the dataset which recycles previously collected data from 24 studies to examine the development of word recognition skill using data from 1963 children. This maximizes the impact of previously collected data while also allowing the study to reliably ask big picture questions on the development of word recognition skill and its relation to chronological age and vocabulary knowledge. Data analysis is rigorous, thought through and very clearly described. Data and code necessary to reproduce the manuscript are shared on the project's Github. The limitations of the study are acknowledged and the manuscript does well to tone down the causal implications of their results.

    3. Reviewer #2 (Public review):

      Summary:

      This paper presents a series of analyses of a large dataset combining many prior studies of early word recognition (Peekbank). The analyses demonstrate that the speed, accuracy and consistency of word learning improves with age. Moreover, the speed of word learning early in development was related to vocabulary growth over time.

      Strengths:

      A key strength of the paper is the use of a large multi-study dataset. This is particularly valuable in the field of early cognitive development, which has (due to practical limitations) often been based on small-scale studies that necessarily provide a shaky foundation for conclusions. The analyses are also well-motivated.

      Weaknesses:

      In an earlier version of the manuscript, the meaning of "word recognition ability" was ambiguous and could have referred to either (A) an intrinsic ability that matures, or (B) knowledge of the common, concrete words typically used in these studies that increases with experience. The revised version of the manuscript identifies these two interpretations and acknowledges that they cannot be teased apart in the current work.

    4. Author response:

      The following is the authors’ response to the original reviews

      General note

      We have issued a new release of the general Peekbank database, 2026.1, which includes more data integrity checks and several more datasets. As a result of this release, the underlying dataset we use in our paper has shifted slightly. The shifts represent a relatively small proportion of the total data and thus these changes have caused only relatively minor changes to our numerical results. We also highlight that we now include a small amount of data regarding children younger than 12 months, increasing the developmental range of our analysis (see Figure 1).

      Reviewer 1 (Public review):

      The limitations of the study are acknowledged to some extent, but need to be improved and ensured that they run throughout the manuscript. Thus, in the discussion, the authors note that the approach is observational and exploratory, and highlight for me a key alternative explanation of the findings, namely that faster children could be faster due to their larger vocabulary, rather than faster children learning more words. Indeed, the latter explanation for the relationship is called into question, given that growth in speed was not related to growth in vocabulary. Here, the authors note that the null result may be related to the fact that they do not sufficiently precise estimates of growth slopes, rather than taking the alternative explanation seriously that there may not be as causal a link between being a faster word learner and a better word learner (learn more words).

      Thank you very much for your challenging and thoughtful comments. In hindsight we did not realize that the way we were writing about our results was ambiguous between several interpretations (one of which we endorse and one of which we do not).

      We respond below to the specific suggestions about causal directionality in the longitudinal analysis, but we certainly believe that we cannot draw strong conclusions about causality from our dataset and have attempted throughout the paper to remove causal language that might have crept into our interpretation.

      In response to your comments, we have made a number of key revisions aimed at qualifying and clarifying our points:

      • The abstract now prominently notes that our design is observational: “In an observational study…”

      • The abstract notes a positive and a negative result in the relationship between word recognition and vocabulary: “Further, across a range of longitudinal models, speed, accuracy, and vocabulary were coupled. Children with overall faster word recognition tended to show faster vocabulary growth, though developmental growth in word recognition skill was not specifically associated with growth in vocabulary.”

      • The abstract removes potential casual language in the final sentence: “... these findings support the view that word recognition is a skill that develops gradually across early childhood and that this skill is deeply intertwined with early language learning.”

      • A new paragraph in the Results introduces the potential hypotheses investigated via the longitudinal models.

      • The final paragraph of the Results section sharpens the contrast between two possible growth hypotheses: “However, we did not find evidence for the stronger version of this claim: in neither the non-linear growth model nor the linear SEM did we find evidence that increases in speed were related to increases in vocabulary size. Thus, our findings do not support a ‘virtuous cycle’ model in which increases in recognition specifically lead to increases in vocabulary size.”

      We hope these changes lead to a manuscript that better aligns with the limitations of the study.

      This is especially since, but correct me if I’m wrong here, the current vocabulary size is not taken into consideration in the model examining vocabulary growth. Given the increasing number of studies showing that current vocabulary knowledge predicts vocabulary growth (Laing, Kalinowski et al, Siew & Vitevitch), one simple alternative explanation is that current vocabulary knowledge predicts both current word recognition skill and later vocabulary knowledge. Is there anything in the data speaking against this hypothesis?

      We think the reviewer’s overall point is generally correct, as we described above, but we want to clarify a specific statistical point. The non-linear longitudinal model of vocabulary growth does in fact take into account a child’s average vocabulary size. (This point feels tricky in a non-linear model but it’s actually quite similar to a linear model for the purposes of this discussion). Basically, vocabulary (at all timepoints) is modeled as a function of age, with both main effects and interactions with age. Critically, each participant is also modeled as having a random intercept capturing their deviation from the average growth pattern across ages (as expressed by the fixed effects). In this model, the “main effect” (here captured by the intercept for the logistic curve in the model) that we observe for speed indicates that vocabulary growth for individuals is predicted to be faster (their curve is shifted left) if their RTs are fast. The presence of the random effects in this model thus “controls” for the fact that some participants have overall higher vocabularies (and are shifted up relative to the average growth curve).

      But, we note that this model does not show an “interaction effect” (here captured by the null effect of RT on the slope parameter in the logistic model). That’s one of the null effects that we now call out much more prominently in the abstract and end of the results (per our response above).

      Equally, while the SEM examines vocabulary growth controlling for age, I wonder about the other way around. What would happen to the effect of age on word recognition skill (in the LME model, S8) if one were to add concurrent vocabulary size? So does chronological age explain word recognition skill or vocabulary knowledge? Right now, the manuscript describes this effect purely related to chronological age, but is it age per se or other cognitive abilities, including a key change across development, namely, vocabulary size? Thus, the presentation of the skill learning hypothesis suggests that age is a proxy for experience, while you actually have here a very nice proxy for experience in terms of children’s vocabulary size.

      Again, thank you for engaging with this tricky set of issues. Overall, our goal is to adjust the manuscript to reflect points of agreement; in particular, we agree that age is a proxy for language experience, vocabulary, and other cognitive changes, and we have stated this explicitly now in the intro to the factor analyses: “In our prior analyses, chronological age acts as a proxy for greater language experience and larger vocabulary as well as a host of other correlated developmental changes in cognition. Now we explicitly explore relations to vocabulary growth and the triadic relationship between age, word recognition, and vocabulary.”

      On the statistical side, we do think that the NLME (non-linear mixed effects; the logistic growth mode) effectively controls for average vocabulary size, as described above. The longitudinal SEM also relates vocabulary growth to growth in word recognition skill. In both models, we find no evidence for coupled growth; instead the evidence points to children with higher baseline word recognition skill showing faster growth in vocabulary (speed intercept significantly related to vocabulary slope, -.14, p < .01) but not the reverse (vocabulary intercept not strongly related to speed slope; -.01, ns).

      More generally, we hope our edits to the paper, detailed above, both clarify this tricky set of issues and also remove inappropriate casual language throughout.

      Critically, while the discussion is more nuanced, the way the abstract is concluded and the way the Introduction is phrased suggest that the study is able to answer a causal question, which, as the authors themselves note, is not possible. The abstract, for instance, states that word recognition becomes faster, more accurate and less variable...consistent with a process of skill learning. And also that this skill plays a role in supporting early language learning, which is very causal language. I don’t think you can really claim that you are testing the two hypotheses you suggest here. The work is definitely embedded in the context of these hypotheses, but are you really able to test them? My worry is that while the discussion is more nuanced, the extent to which this study will then be cited down the line as showing that children learn more words down the line because they are faster at recognizing words, and anything that you can do to tamper with such interpretations would be good for the literature. For me, this should not just be relegated to the discussion but should be touched upon in the abstract and Introduction.

      Thanks for pushing us to be more precise with how we frame and describe our findings. We agree with the reviewer that our findings do not warrant strong conclusions about the causal role of word recognition skill in vocabulary growth. Per our response above, we have now tried to carefully revise our language throughout the paper (in particular, in the abstract and introduction, as noted by the reviewer).

      Finally, it would help to talk more about the mechanisms at work in any relationship between word recognition and language learning. It seems to me that this would rely on some predictive processing framework, given the description on page 4, and it would be good to make this clear (faster and more accurately you can recognize a ball, better use this evidence to infer the speaker’s intended meaning).

      Thanks, this is a great point. We’ve revised this text and added references to predictive processing, unpacking a problematic paragraph into two:

      “Familiar word recognition -- as measured by LWL -- is hypothesized to play a key role in language learning (19). The idea, in a nutshell, is that the faster and more accurately a child can process incoming words, the more opportunities they have for learning. Consider a child hearing the utterance "Can you put the ball in the crate?" The better the child can recognize the word "ball", the better they can use this evidence to help infer the speaker's intended meaning, allowing possible inferences about the meaning of the less familiar word, "crate" (20).

      “Real time language processing, including word recognition, relies heavily on predictive processing, in which comprehenders integrate expectations from prior linguistic context with noisy and ephemeral incoming signals (21, 22). The more input a child receives, the better their predictions are likely to be, and hence the more they can learn (19, 23). Indeed, measurements of children's language input at home are consistently associated with their vocabulary size (24, 25). And, in line with this predictive processing framework, one important study found that children's word recognition speed mediated the longitudinal relationship between home language input and vocabulary growth (26). Thus, word recognition is thought to be a key support for ongoing word learning.”

      Equally, when referring to word recognition, it would be good to clarify what this refers to - how well a child knows what a word refers to (and in the context of LWL, what it does not refer to) or how quickly it directs attention to what is referred to.

      Thanks, we’ve added a capsule definition in the second paragraph, and added the sentence “This procedure [LWL] measures the general construct of word recognition by operationalizing knowledge of a meaning as visual attention to a specific named referent.” We hope this clarifies the relationship between LWL and word recognition.

      With regards to the data, I wonder if there is a clustering of kids past 24 months that is happening here, looking at Figures 1 and 2, where it seems like there is less change past the 24-month point. Is there any way to look at whether the effect of age or vocabulary on word recognition is not linear but asymptotic?

      Thanks for pointing this out; we do see what you are talking about but think it’s being handled appropriately in the analysis. In Figure 1 it clearly looks like changes to RT are asymptotic – this is why we analyze the logarithm of RT throughout the paper. In Supplement S6 we show that reaction time is indeed best fit by a log-log function. Your question about Figure 2 asks whether there is further structure beyond the log-log fit; in Supplement S7 we show some analyses that suggest a polynomial fit is not better than the log-log fit; there is some small additional linear effect of age over and above the log-log fit, but it’s minor and pretty hard to interpret in our view.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      Page 3. Word production may manifest in overt behaviour but need not reflect complete knowledge. A child can say the word dog and use it to refer to a cat.

      This is a good point. Since we are not able to speak to the precision of meaning representations (an important issue in its own right), we have omitted the phrase “with incomplete knowledge.”

      Page 4. The first two sentences of the paragraph beginning with word recognition ability... don’t go together. The second sentence does not support the claim that word recognition plays a role in language learning.

      Thanks, we’ve tried to smooth out this transition as part of unpacking the role of predictive processes.

      Page 4. “predicts children’s standardized test scores years later” - make clear what test scores are here.

      We added some additional details. The specific tests were the CELF (expressive language) and the KABC (IQ), but we thought too much detail might be distracting.

      Page 5. I love Table 1, but would like for the data to be weighted somehow. So, given that some studies had a lot more trials and more children, what percentage of the data did this study contribute? That allows a clearer view of how biased the sample is in certain studies. The x in CDIS and longitudinal could be aligned to the right. I kept wondering why there was an x near some trials.

      Thanks, we’ve adjusted the table to add the percentage of the total dataset (in trials) due to each study and fixed the alignment issue.

      Page 6. 12 million individual samples: what samples are these? Individual data points per trial per time point. Making this clear would be great.

      Clarified, thanks.

      Page 9. Your accuracy measures only seem to consider the target. From what I remember of my preferential looking days, this measure usually also includes the distractor. Why do you not do this? This is especially since you have such a wide age range, so if a 12-month-old only looks for about 50 per cent of the trial and spends that time looking at the target, that is very different from a child who looks at the screen all of the trial and spends less time looking at the target here.

      Sorry for any lack of clarity: we do in fact compute accuracy as the ratio of looking to target over looking to target plus looking to distractor. We have added this information to the parenthetical referenced above: “... accuracy (more target looking; computed as the ratio of target to target plus distractor looking)”.

      Page 12. I only found out that age was in this model by looking at S9.

      Thanks for mentioning this omission, we’ve clarified in the text: “We initially add age as an additional variable to our models to explore whether this factor structure relates to age; later we treat age as a predictor of latent factors.”

      Page 12. Isn’t it trivial that speed and accuracy show negative covariance, especially given how you measure accuracy? Thus, if I take longer to fixate the target, I have less time to look at the target during the trial. If, however, I included the distractor in my accuracy measure, then I could still take longer to look at the target, but still look more at the target than the distractor.

      Thanks for mentioning that this covariance is not the key result of interest; that observation didn’t come out in the text. Now we note that this covariation is “... as expected since they [speed and accuracy] are derived from the same data.” Note per above that accuracy is computed as target / target + distractor looking; even so, your observation is correct: slower looking at the target means lower accuracy at least to some degree.

      Page 19. If you excluded data from trials with less than 50% of timepoints, how did this vary across age? Arguably, your study has to worry less about this, given your sample size, but it would be nice to know, which you could include in the percentage of data that each study contributed to the final sample.

      Thanks, we’ve added this information to a new table in S1.

      Reviewer #2 (Public review):

      First, I wasn’t entirely clear about what the authors meant by “word recognition ability”. For much of the manuscript (including the use of the term “word recognition ability” itself), this comes across as an intrinsic ability or skill that improves with development. Alternatively, the speed and accuracy metrics taken from studies in Peekbank might capture children’s increasing knowledge of the common, concrete words typically used in these studies. To me, this is a somewhat different construct from a general skill at recognizing words. It would be helpful if the authors could clarify which construct they intend to capture, or if it is not possible to distinguish between these constructs from the Peekbank data.

      In response to this comment and related comments above, we’ve added text to the first two paragraphs trying to clarify the general construct that we’re talking about – recognizing the meaning of a word in real-time language comprehension. We’ve also clarified several times throughout the introduction that we’re talking about familiar word recognition, that is, the ability to recognize specific known words. Further, we directly acknowledge the issue above in the introduction:

      “Critically, most word recognition paradigms use words that children at the target age are reported to understand and produce. They are thus not indices of vocabulary size but rather measures of how quickly and accurately the child can recognize a familiar spoken word and use it to guide their visual attention to a referent. However, it is unknown the extent to which specific responses reflect an individual child's general speed of language processing versus their familiarity of specific words.”

      Second, and relatedly, if the source of the age-related improvements is increasing experience with the common concrete words used in the Peekbank studies, then one might expect word recognition and improvements with age to be related to word frequency, given that more frequent words are experienced more often. Word frequency predicts word knowledge when assessed using CDI data. Can effects of frequency be detected in Peekbank word recognition metrics? If not, why? Similarly, is the speed and accuracy of word recognition in Peekbank data related to CDI-derived word age of acquisition, and again, if not, why?

      This is a fascinating set of ideas, and one that we’ve pursued extensively using the Peekbank data. Unfortunately, we think it is out of scope for the current paper, which focuses on child-level metrics (including vocabulary and processing measures). Right now the current paper doesn’t include any analysis of individual words.

      Just to expand a bit on the problem here: unfortunately, modeling word recognition as a simple linear function of (log) word frequency is only possible in the case that distractors are held constant (e.g., “ball” always has “book” as its distractor), because distractor frequency plays an important role in the recognition process. However, in our dataset, words are paired with many different distractors across studies. This property means a fairly complex model of the LWL decision process would be necessary for a model to successfully predict effects for individual words. While such a model is an exciting research goal, it’s not something we can include in the current manuscript.

      Finally, there is a bit of a risk of the main findings of this paper coming across as a foregone conclusion. I.e., how could it be otherwise that word recognition improves with development?

      Reviewer #2 (Recommendations for the authors):

      Regarding the feedback about the risk of the findings coming across as a foregone conclusion - perhaps a primary place in the paper where it would be useful to clarify this point is on page 6, in the paragraph beginning, “We investigate two specific hypotheses here. First, one influential theory...”. Here, it might be worth clarifying whether there are alternative ideas about the emergence of word recognition in childhood that predict different patterns, so that the findings of the current paper can be framed as shedding new light on word recognition in development, rather than a confirmation of the common-sense idea that word recognition must improve over development.

      Thanks, we appreciate this feedback and it’s something we’ve struggled with in this project. Our conclusion is that this paper does not constitute a binary hypothesis test of e.g., whether word recognition is linked to vocabulary development. Instead, we lean into the idea that there are empirical issues (rather than hypotheses) that have not been quantified sufficiently. Thus, we end the revised introduction with the following paragraph:

      “Across both of these issues, the contribution of our work here lies in the detailed quantitative description of development. Nearly every theory of language learning assumes some role for continuous developmental change in word recognition, but these assumptions have not previously been anchored to specific measurements. Hence neither the functional form of the assumed changes nor their concurrent and predictive relationships to vocabulary have been quantified. We leverage the Peekbank dataset to accomplish these goals.”

    1. eLife Assessment

      This important study is the first characterization of the phenotype caused by a lack of Eml3 expression in mice. Mutant animals present a disrupted pial basement membrane, leading to focal extrusions from the cerebral cortex, called ectopias. The methodology is convincing and the conclusions are solid, although further investigations on the molecular and cellular mechanisms are required to improve the manuscript. This work would be of interest to neural development biologists and human geneticists working on brain disorders.

    2. Reviewer #2 (Public review):

      Summary:

      In this manuscript, the authors investigate the role of the microtubule-binding protein EML3 during cortical development through the generation and characterization of an Eml3 mouse mutant. The authors focus mainly on the effects of EML3 loss on brain development, although Eml3 mouse mutants also present with developmental delay and growth restriction, and die perinatally due to respiratory distress caused by delayed maturation of the lungs. The main finding in the developing cortex is the presence of focal neuronal ectopias, which contain neurons from all cortical layers, as revealed by immunostaining. The authors use electron microscopy to show that ectopias seem to be caused by disruption to the pial basement membrane at early stages of development, which allows neurons to breach through it. To find a functional link between EML3 and the observed phenotype, studies are conducted that demonstrate expression of EML3 in radial glia cells and mesenchymal cells, both cell types involved in the formation and maintenance of the pial basement membrane. Furthermore, interaction partners for EML3 are identified through coIP-MS analysis, including tubulin beta-3, 14-3-3 proteins and cytoplasmic dynein light chain. However, mice carrying a mutant EML3 allele engineered to abolish the interaction between EML3 and cytoplasmic dynein light chain do not recapitulate any of the symptoms of complete EML3 loss.

      Strengths:

      The manuscript offers several important strengths that contribute significantly to the field. This study presents the first characterization of Eml3 knockout animals, providing novel insights into the role of Eml3 in vivo. Information on Eml3 function so far was restricted to cell culture data, so the results in this manuscript start to fill an important gap in our knowledge about this microtubule-binding protein. The experimental approach is carefully designed, with appropriate controls that ensure the reliability of the data. Moreover, the authors have addressed a key challenge in the analysis, namely the developmental delay of the knockout animals. By implementing a strategy to match developmental stages between wild-type and knockout groups, they allow for meaningful and valid comparisons between the two genotypes. Importantly, the authors have successfully generated three different Eml3 mutant mouse lines (knockout, floxed and with disrupted binding to cytoplasmic dynein light chain), which are very valuable tools for the broader scientific community to further study the roles of this gene in development and disease in the future.

      Weaknesses:

      While the manuscript presents valuable data, there are also several weaknesses that limit the overall impact of the study. Most notably, there is no clear mechanistic link established between the loss of Eml3 function and the observed phenotype, leaving the biological significance of the findings somewhat speculative, as it is not straightforward how a microtubule-associated protein can have an impact on the stability of the pial basement membrane. In this respect, but also in general for the whole manuscript, there seems to be a considerable amount of experimental work that has been conducted but is not presented, possibly due to the negative nature of the results. Additionally, the phenotype reported appears to be dependent on the genetic background, as it is absent in the CD1 strain. This observation raises concerns as to how robust the results are and how much they can be generalized to other mouse strains, but, more importantly, to humans.

    3. Reviewer #3 (Public review):

      Summary:

      This work aims to understand the role of Echinoderm Microtubule-associated Protein-like 3 (EML3) on embryogenesis and neocortical development. Importantly, this work shows that depletion of EML3 cause focal neuronal ectopias by disrupting the structural integrity of the pial basement membrane, describing a new model of cobblestone brain malformation. Another member of the EML family, EML1, has been already shown to trigger neuronal migration disorders, particularly subcortical band heterotopia by affecting cell polarity. The results presented here point to a different mechanism of action. The authors show that EML3 is expressed in radial glia cells and mesenchymal cells in the pial region and upon EML3 depletion (i.e., Eml3 mutant mice) the pial basement membrane is structurally damaged allowing migrating neuroblasts to ectopically migrate through. Answering, in this case, that the weakening of the pial basement membrane is a prerequisite of focal neuronal ectopias. The authors provide a meticulous characterization of the Eml3 mutant mice, strengthening the conclusions of the results.

      Strengths:

      The authors provide a very detailed analysis of the defects observed in Eml3 mutant mice, by providing not only results by inferred day of conception but by classifying embryos by their number of somite pairs.

      Weaknesses:

      Most of the weaknesses originally raised by the reviewer had been addressed.

    4. Author response:

      The following is the authors’ response to the original reviews

      The following revisions have been made to address most of the publicly available suggestions made by the Reviewers.

      We have also corrected formatting issues in two figure panels:

      Fig.1B: embryo ages added over placenta images.

      Fig. 4D: fixed a truncated label.

      Reviewer #1 (Public review):

      The study would benefit from clearer evidence and additional experiments that would help to establish the molecular and cellular mechanisms underlying the brain phenotype, the central topic of the work.

      We agree that additional experiments are necessary to elucidate the mechanism(s) by which EML3 deficiency causes the observed developmental phenotypes. However, as no further experimentation is possible due to the closure of our laboratory, we are committed to sharing available materials including custom antibodies and cryopreserved sperm from our mouse lines. We include previously generated experimental data not presented in the original submission. While these additional data do not reveal the mechanisms, we believe that sharing hypotheses that were experimentally ruled out will benefit the scientific community.

      M&M: we have added a section listing several tissue-specific Eml3 KOs generated. All of the generated cKO mice were indistinguishable from Eml3<sup>wt</sup> controls.

      Supp. Fig. 2 with staining for major PBM components has been added. We have included antibody information to M&M.

      Reviewer #2 (Public review):

      (1) While the manuscript presents valuable data, there are also several weaknesses that limit the overall impact of the study. Most notably, there is no clear mechanistic link established between the loss of Eml3 function and the observed phenotype, leaving the biological significance of the findings somewhat speculative, as it is not straightforward how a microtubule-associated protein can have an impact on the stability of the pial basement membrane. In this respect, but also in general for the whole manuscript, there seems to be a considerable amount of experimental work that has been conducted but is not presented, possibly due to the negative nature of the results. At least some of those results could be shown, particularly (but not only) the stainings for the composition of the ECM components.

      We agree that additional experiments are necessary to elucidate the mechanisms at play. While we cannot conduct further experiments, we provide additional existing data, including a new Supp. Fig. 2 showing ECM component staining. As this reviewer rightly anticipated, these results might not clarify the mechanism but sharing the hypotheses that were already experimentally tested will be helpful.

      (2) Additionally, the phenotype reported appears to be dependent on the genetic background, as it is absent in the CD1 strain. This observation raises concerns as to how robust the results are and how much they can be generalized to other mouse strains, but, more importantly, to humans.

      Indeed, we have determined that genetic background greatly influences the manifestation of developmental defects caused by absence or mutation of the EML3 protein in mice. Modifier genes appear to play a significant role in phenotypic expression. In humans, the presence or absence of such modifiers may result in a broad spectrum of outcomes from no clinical relevance, as seen in CD1 mice, to potential intrauterine mortality. We agree that this underscores the challenge of translating mouse model findings to human implications. Future studies could include a search for EML3 non-coding regulatory mutations and expanded analysis of neuronal development defects, such as COB, as well as cases of intrauterine growth restriction (IUGR).

      (3) There is no data included in the manuscript about the generation and analysis of the Eml3AAA/AAA mouse line. This is an important omission, especially as no details on the validation or phenotypic characterization of this additional mouse line are provided. Including these elements would greatly strengthen the rigor and interpretability of the work, especially if that mouse line is to be shared with the scientific community.

      We acknowledge this oversight and have added a Materials and Methods section describing the generation of Eml3 TQT86AAA mice. Validation of the Eml3 TQT86AAA mice included showing absence of EML3-DYNLL binding in our co-IP MS data in Table 3. We state that the validated Eml3 TQT86AAA mice were phenotypically indistinguishable from Eml3<sup>wt</sup> control mice.

      Reviewer #3 (Public review):

      (1) Besides the data provided in the figures, the authors report a significant amount of experiments/results as "Data not shown". Negative data is still important data to report, and the authors may want to choose some crucial "not shown data" to report in the manuscript.

      We have incorporated key datasets previously omitted, with priority given to those specifically requested by Reviewer #2.

      (2) Results in Figure 3A apparently contradict results in 3B. A better explanation of the results should improve understanding of the data. Even though the conclusion that the "onset and progression of neurogenesis is normal in Eml3 null mice" seems logical based on the data, the final numbers are not (Figure 3A) and this should be acknowledged, as well.

      We provide further explanations for the data presented in figures 3A and 3B to better convey the fact that the two datasets are not contradicting. In essence, since Eml3 null mice are developmentally delayed (as determined by the number of somites at a specific age, Fig. 1C), the milestones in neurogenesis are reached at a later age in Eml3 null mice, thus at embryonic age E11.5 Eml3 null mice have fewer TBR2-positive cells (Fig. 3A). However, Eml3 null mice have reached the same neurogenesis milestones as their WT counterparts when they have the same number of somites (Fig. 3B).

      Results section for Fig. 3: we provide additional explanations that reconcile the results shown in Fig. 3A and Fig. 3B.

      (3) The authors should define which cell types are identified by SOX1 and PAX6.

      We have defined the expression timing and cell identity marked by SOX1 and PAX6 in neural progenitors during cortical development.

    1. eLife Assessment

      In this important study, Li et al. identify estrogen receptor 1-expressing neurons (ESR1+) in Barrington's nucleus as key regulators coordinating both bladder contraction and the relaxation of the external urethral sphincter. Using appropriate and validated methodologies aligned with the current state of the art, the data are convincing and of generally high quality.

    2. Reviewer #1 (Public review):

      [Editor's note: this version has been assessed by the Reviewing Editor with further input from the original reviewers. The authors have addressed the comments raised in the previous round of review.]

      Summary:

      Urination requires precise coordination between the bladder and external urethral sphincter (EUS), while the neural substrates controlling this coordination remain poorly understood. In this study, Li et al. identify estrogen receptor 1-expressing neurons (ESR1+) in Barrington's nucleus as key regulators that faithfully initiate or suspend urination. Results from peripheral nerve lesions suggest that BarEsr1 neurons play independent roles in controlling bladder contraction and relaxation of the EUS. Finally, the authors performed region-specific retrograde tracing, claiming that distinct populations of BarEsr1 neurons target specific spinal nuclei involved in regulating the bladder and EUS, respectively.

      Strength:

      Overall, the work is done with high quality. The authors integrate several cutting-edge technologies and sophisticated, thorough analyses, including opto-tagged single unit recordings, combined optogenetics and urodynamics, particularly those following distinct peripheral nerve lesions.

      Comments on revised version:

      During the revision, the authors have adequately addressed my concerns and made the suggested changes accordingly. I have no additional comments.

    3. Reviewer #2 (Public review):

      Summary:

      The authors have performed a rigorous study to assess the role of ESR1+ neurons in the PMC to control coordination of bladder and sphincter muscles during urination. This is an extension of previous work defining the role of these brainstem neurons, and convincingly adds to the understanding of their role as master regulators of urination. This is a thorough, well-done study that clarifies how the Pontine micturition center coordinates different muscle groups for efficient urination, but there are some questions and considerations that remain.

      Strengths:

      These data are thorough and convincing in showing that ESR1+ PMC neurons exert coordinated control over both the bladder and sphincter activity, which is essential for efficient urination. The anatomical distinctions in pelvic versus pudendal control is clear, and it's an advance to understand how this coordination occurs. This work offers a clearer picture of how micturition is driven.

      Weaknesses:

      The dynamics of how this population of ESR1+ neurons is engaged in natural urination events remains unclear. Not all ESR1+neurons are always engaged, and it is not measured whether this is simply variation in population activity, or if more neurons are engaged during more intense starting bladder pressures, for instance. In particular, the response dynamics of single and doubly-projecting neurons are not defined. Additionally, the model for how these neurons coordinate with CRH+ neuron activity in the PMC is not addressed, although these cell types seem to be engaged at the same time. Lastly, it would be interesting to know how sensory input can likely modulate the activity of these neurons, but this is perhaps a future direction.

    4. Reviewer #3 (Public review):

      Summary:

      The paper by Li et al explored the role of Estrogen receptor 1 (Esr1) expressing neurons in the pontine micturition center (PMC), a brainstem region also known as Barrington's nucleus (Hou wt al 2016, Keller et al 2018). First the author conducted bulk Ca2+ imaging/unit recording from PMCESR1 to investigate the correlations of PMCESR1 neural activity to voiding behavior in conscious mice and bladder pressure/external urethral muscle activity in urethane anesthetized mice. Next the authors conducted optogenetics inactivation/activation of PMCESR1 to confirm the contribution to the voiding behavior also conducted peripheral nerve transection together with optogenetics activation to confirm the independent control of bladder pressure and urethral sphincter muscle.

      Comments on revised version:

      No concerns. All my major questions were addressed.

    5. Author response:

      The following is the authors’ response to the original reviews.

      We would like to express our deep appreciation to the editor and reviewers for their constructive comments and suggestions, which have significantly improved the quality of our manuscript. In response, we have carefully revised the manuscript, addressed all comments, and performed additional experiments and analyses to strengthen our findings.

      (1) We repeated retrograde tracing using CTB-647 to verify precise targeting of SPN and DGC neurons, as shown in the new Figure 7.

      (2) We performed dual retrograde tracing combined with fiber photometry or optogenetic activation to investigate the role of PMC dual-projecting neurons in the control of urination, as shown in Figure supplements 11 and 12.

      (3) We conducted new experiments activating PMC<sup>ESR1+</sup> neurons after PDNx to assess their role in urination, as shown in new Figure 6.

      (4) We added a more detailed analysis of the dynamics of neural responses in PMC<sup>ESR1+</sup> neurons in Figure supplements 3F-3G.

      (5) We analyzed peak Ca<sup>2+</sup> signals in the PMC during and after the onset of EMG bursting, as shown in Figure supplement 4F.

      (6) We added a comparison of spontaneous and light-induced spikes in PMC<sup>ESR1+</sup> neurons, as shown in Figure supplements 3B–3C.

      (7) We expanded the Discussion to address how PMC<sup>ESR1+</sup> neurons coordinate bladder contraction and sphincter relaxation to control both the initiation and suspension of urination.

      We hope these revisions meet the reviewers' expectations and contribute to the improvement of our manuscript.

      Reviewer #1 (Public review):

      Summary:

      Urination requires precise coordination between the bladder and external urethral sphincter (EUS), while the neural substrates controlling this coordination remain poorly understood. In this study, Li et al. identify estrogen receptor 1-expressing neurons (ESR1+) in Barrington's nucleus as key regulators that faithfully initiate or suspend urination. Results from peripheral nerve lesions suggest that BarEsr1 neurons play independent roles in controlling bladder contraction and relaxation of the EUS. Finally, the authors performed region-specific retrograde tracing, claiming that distinct populations of BarEsr1 neurons target specific spinal nuclei involved in regulating the bladder and EUS, respectively.

      Strengths:

      Overall, the work is of high quality. The authors integrate several cutting-edge technologies and sophisticated, thorough analyses, including opto-tagged single unit recordings, combined optogenetics, and urodynamics, particularly those following distinct peripheral nerve lesions.

      We are grateful for your insightful and constructive comments, which affirmed the importance and technical depth of our work. Thank you for dedicating your expertise and time to reviewing our manuscript. Guided by your suggestions, we have revised the paper as detailed below.

      Weaknesses:

      (1) My major concern is the novelty of this study. Keller et al. 2018 have shown that BarEsr1 neurons are active during urination and play an essential role in relaxing the external urethral sphincter (EUS). Minimally, substantial content that merely confirms previous findings (e.g. Figures 1A-E; Figures 3A-E) should be move to the supplementary datasets.

      Thank you for this valuable and constructive comment. We fully agree that the novelty of our study relative to Keller et al., 2018 must be made explicit. Keller et al. established that PMC<sup>ESR1+</sup> neurons are active during socially evoked urine-marking behavior (voluntary urination) and demonstrated their essential role in relaxing the EUS. Their study mainly focused on behavioral context and EUS relaxation. In contrast, our work addresses a distinct, mechanistic question: how these same neurons participate in reflexive, physiological urination and coordinate both bladder detrusor contraction and EUS relaxation.

      Novel aspects of the present study:

      (1) Temporal dynamics of PMC<sup>ESR1+</sup> neurons during reflexive micturition.

      Using opto-tagging and single-unit recordings, we reveal the precise firing pattern of PMC<sup>ESR1+</sup> neurons during reflexive voiding. Simultaneous fiber photometry, cystometry, and EUS-EMG recordings demonstrate that population-level activity of PMC<sup>ESR1+</sup> neurons precedes and tightly correlates with both bladder contraction and EUS relaxation a coordination not previously demonstrated.

      (2) Causal role in reflexive urination.

      Manual closed-loop optogenetic inhibition at the onset of reflexive voiding acutely terminates EUS bursting and bladder contraction, immediately halting urine release.

      (3) Dual control of bladder and EUS.

      Optogenetic activation combined with selective pelvic or pudendal nerve transection shows that PMC<sup>ESR1+</sup> neurons drive both bladder contraction and EUS relaxation, revealing a coordinating role beyond EUS relaxation alone.

      (4) Anatomical substrate for coordinated control of bladder contraction and EUS relaxation in reflexive urination.

      Retrograde tracing identifies three spinal-projecting sub-populations: SPN-only, DGC-only, and dual-targeting neurons, providing a circuit-level explanation for the simultaneous control of bladder and EUS.

      Following your suggestion, panels that merely replicate Keller et al. (former Figures 1A–1E and Figures 3A–3E) have been moved to new Figure Supplements 1 and 7, respectively, so that the main figures now emphasize the new mechanistic findings.

      (2) I also have concerns regarding the results showing that the inactivation of BarEsr1 neurons led to the cessation of EUS muscle firing (Figures 2G and S5C). As shown in the cartoon illustration of Figure 8, spinal projections of BarEsr1 neurons contact interneurons (presumably inhibitory) that innervate motor neurons, which in turn excite the EUS. I would therefore expect that the inactivation of BarEsr1 should shift the EUS firing pattern from phasic (as relaxation) to tonic (removal of relaxation), rather than stopping their firing entirely. Could the authors comment on this and provide potential reasons or mechanisms for this finding?

      Thank you for this crucial comment. We apologize that the representative EUS-EMG traces in Figures 2G and S5C were too small to be clearly seen and that the corresponding results description was not sufficiently accurate. We have now replaced these EMG traces with enlarged versions (revised Figures 2G and S5C) and revised the corresponding Results section (lines 184, 197, 340-341). Based on the enlarged traces, we found that acute photoinhibition of PMC<sup>ESR1+</sup> neurons at the onset of phasic EUS-EMG bursting shifted the EUS firing pattern from large-amplitude phasic bursts to low-amplitude tonic firing. This suggests that ongoing activity of PMC<sup>ESR1+</sup> neurons is required to maintain phasic EUS bursting. A similar shift from phasic to tonic EUS-EMG activity during optogenetic silencing of PMC<sup>ESR1+</sup> neurons was reported by Keller et al., 2018 (Figure supplement 8C), confirming the reproducibility of the phenotype. We propose that the potential mechanism of this low-amplitude tonic activity may be mediated in part by a spinal reflex pathway (the guarding reflex) for preventing urination, whereby the loss of PMC<sup>ESR1+</sup> neurons-mediated supraspinal facilitation reduces inhibition of spinal interneurons, leading to enhanced baseline excitability of EUS motor neurons in response to bladder afferent input during bladder distension (William C. de Groat et al., Comprehensive Physiology. 2015, PMID: 25589273).

      (3) Current evidence is insufficient to support the claim that the majority of BarEsr1 neurons innervate the SPN but not DGC. The current spinal images are uninformative, as the fluorescence reflects the distribution of Esr1- or Crh-expressing neurons in the spinal cord, along with descending BarEsr1 or BarCrh axons. Given the close anatomical proximity of these two nuclei, a more thorough histological analysis is required to demonstrate that the spinal injections were accurately confined to either the SPN or the DGC.

      Thank you for raising this important concern. To rigorously verify that our spinal injections were confined to either the SPN or the DGC, we performed new retrograde-tracing experiments in ESR1-Cre and CRH-Cre mice. We injected a mixture of AAV-Retro-DIO-mCherry or AAV-Retro-DIO-EGFP with the retrograde tracer CTB-647 specifically into the SPN or DGC (Methods, lines 465-466). Only animals in which CTB-647 fluorescence was strictly limited to the target nucleus, without detectable spread to the adjacent region, were included in the analysis (new Figures 7A and 7E). These results confirm our original observation that PMC<sup>ESR1+</sup> neurons comprise three distinct spinal-projection subpopulations: one (19.0%) targeting the SPN, one (52.2%) innervating the DGC, and a third (28.8%) projecting to both regions (Results, lines 304–306; new Figures 7F–7H). In addition, the majority of PMC<sup>CRH+</sup> neurons project to the SPN but not the DGC (new Figures 7B–7D; Results, lines 297–301). We have assembled new Figure 7 using the newly acquired spinal images and the validated data.

      Reviewer #1 (Recommendations for the authors):

      From the abstract: "Anatomically, PMCESR1+ cells possess two subpopulations projecting to either the pelvic or pudendal nerve". I don't think these neurons directly project to either nerve.

      Thank you for this precise comment. We apologize for incorrectly stating that PMC<sup>ESR1+</sup> cells project directly to the pelvic or pudendal nerves. In the revised Abstract (lines 32–36) we have rephrased the sentence to clarify the actual anatomy: “Anatomically, PMC<sup>ESR1+</sup> neurons consist of three distinct spinal-projection-based subpopulations: one targeting the sacral parasympathetic nucleus (SPN), one innervating the dorsal gray commissure (DGC), and a third that projects to both regions, thereby enforcing the coordination of bladder contraction and sphincter relaxation in a rigid temporal sequence.”. We trust this revision now accurately reflects the anatomical findings.

      Reviewer #2 (Public review):

      Summary:

      The authors have performed a rigorous study to assess the role of ESR1+ neurons in the PMC to control the coordination of bladder and sphincter muscles during urination. This is an important extension of previous work defining the role of these brainstem neurons, and convincingly adds to the understanding of their role as master regulators of urination. This is a thorough, well-done study that clarifies how the Pontine micturition center coordinates different muscle groups for efficient urination, but there are some questions and considerations that remain.

      Strengths:

      These data are thorough and convincing in showing that ESR1+PMC neurons exert coordinated control over both the bladder and sphincter activity, which is essential for efficient urination. The anatomical distinctions in pelvic versus pudendal control are clear, and it's an advance to understand how this coordination occurs. This work offers a clearer picture of how micturition is driven.

      We sincerely thank you for highlighting the rigor of our study and for recognizing the advance in understanding how PMC<sup>ESR1+</sup> neurons exert coordinated, anatomically segregated control over bladder and sphincter. We also appreciate the constructive suggestions that helped us further improve clarity, which we address point-by-point below.

      Weaknesses:

      The dynamics of how this population of ESR1+ neurons is engaged in natural urination events remains unclear. Not all ESR1+ neurons are always engaged, and it is not measured whether this is simply variation in population activity, or if more neurons are engaged during more intense starting bladder pressures, for instance. In particular, the response dynamics of single and doubly-projecting neurons are not defined. Additionally, the model for how these neurons coordinate with CRH+ neuron activity in the PMC is not addressed, although these cell types seem to be engaged at the same time. Lastly, it would be interesting to know how sensory input can likely modulate the activity of these neurons, but this is perhaps a future direction.

      Thank you for this insightful comment. First, we agree that not all ESR1+ neurons are consistently engaged during urination (Figure 1B). Because bladder pressure was not measured during the opto-tagging experiments, we cannot determine whether this reflects trial-to-trial variability in population activity or pressure-dependent recruitment of additional neurons. We speculate that stronger starting bladder pressures may recruit a larger subset of ESR1+ neurons, analogous to graded, pressure-dependent recruitment observed in peripheral sensory neurons (Bruns et al., J Neural Eng. 2011, PMID: 21878706; Marshall et al., Nature. 2020, PMID: 33057202).

      Second, using fiber photometry recording and optogenetic activation, we examined the dynamics of dual-projecting neurons in the PMC that were retrogradely labeled from the SPN and DGC. Their activity correlated with bladder contraction and sphincter relaxation, and optogenetic activation sequentially induced these events to trigger urination (see Recommendation #8). Although retrograde labeling captured only a subset of dual-projecting neurons, the results indicate that they coordinate bladder and sphincter activity.

      Third, previous studies suggest that PMC<sup>CRH+</sup> cells are associated with bladder contraction and likely serve as an integration center for context-dependent micturition behavior (Hou et al., Cell. 2016, PMID: 27662084; Ito et al., Elife. 2020, PMID: 32347794). We therefore propose that PMC<sup>CRH+</sup> cells establish the baseline conditions and contextual readiness for voiding, whereas PMC<sup>ESR1+</sup> cells act as the executive command to reliably initiate and execute the event.

      Finally, we agree that sensory inputs likely modulate PMC<sup>ESR1+</sup> neuron activity. Although this falls beyond the scope of the present study, it represents an important avenue for future investigation.

      Reviewer #2 (Recommendations for the authors):

      (1) In the introduction, the authors write that Keller 2018 only showed this ESR1 population to induce EUS relaxation, but those results also do show bladder contraction with photostimulation of this population. While the authors' work extends this finding in important ways, this should be acknowledged (line 60).

      Thank you for this important correction. We have now revised the Introduction to explicitly acknowledge that stimulation of neurons expressing estrogen receptor 1 (ESR1) in the PMC (PMC<sup>ESR1+</sup>) contributes to sphincter relaxation and increased bladder pressure (Introduction, lines 60-62), as originally reported by Keller et al., 2018.

      (2) I think a more detailed analysis of the dynamics of neural responses in the PMC ESR1 neurons would be valuable. For example: are the same cells always engaged before micturition, or do different populations activate on different trials? Can the authors comment on the half of the opto-tagged ESR1 population that is not firing during urination? Do they ever fire? A cell-by-cell analysis of which neurons are engaged over multiple trials would be very valuable to understand the dynamics of population activity. Figure 1H shows cumulative sessions, but what do single sessions look like?

      Thank you for these valuable comments. In response, we have performed refined single-trial analyses of neuronal activity, as detailed in the point-by-point replies below.

      For example: are the same cells always engaged before micturition, or do different populations activate on different trials?

      Among 11 PMC<sup>ESR1+</sup> units that showed urination-related excitation, 8 units exhibited a consistent firing increase in every voiding trial, whereas the remaining 3 increased their discharge in >78 % of trials (Figure 1B; new Figure supplement 3F). Thus, the same PMC<sup>ESR1+</sup> cells are recruited repeatedly, rather than distinct populations being activated on different trials. We have added this clarification to Results (lines 106–108).

      Can the authors comment on the half of the opto-tagged ESR1 population that is not firing during urination? Do they ever fire? A cell-by-cell analysis of which neurons are engaged over multiple trials would be very valuable to understand the dynamics of population activity.

      Approximately half of the opto-tagged PMC<sup>ESR1+</sup> cells showed no increase in firing rate during urination, yet exhibited spontaneous spikes at other times (new Figure supplement 3G), confirming their electrical competence. Because the PMC also participates in defecation, uterine activity, and other pelvic functions (Rouzade-Dominguez et al., Eur J Neurosci. 2003, PMID: 14686905; Schellino et al., Frontiers in Neuroanatomy. 2020, PMID: 33013330; Quaghebeur et al., Auton Neurosci. 2021, PMID: 34391125), these ESR1+ neurons may serve functions other than urination. We have now added this cell-by-cell analysis and discussion to the manuscript (Results, lines 108-112).

      Figure 1 H shows cumulative sessions, but what do single sessions look like?

      As shown in new Figure supplements 3F–3G, single-session raster plots reveal that PMC<sup>ESR1+</sup> neurons display consistent firing patterns across individual trials. Neurons whose firing rate increased during urination did so in most trials (Figure supplement 3F), whereas neurons unrelated to voiding remained silent or showed no discernible rate change during voiding across trials (Figure supplement 3G). These single-session observations are consistent with the cumulative population analysis shown in Figure 1H (new Figure 1B).

      (3) Supplemental Figure 4: It seems clear from this figure that NVCs are only occurring when the sphincter fails to engage. Can the authors quantify how often this is the case?

      Thank you for this important point. We have now quantified the occurrence of non-voiding contractions (NVCs) across all 229 bladder contraction events from 3 mice shown in Supplemental Figure 4. NVCs were observed exclusively when the external urethral sphincter failed to relax, accounting for 62/229 events (27.1 %), whereas coordinated voiding contractions (VCs) occurred in the remaining 167 events (72.9 %). These new data are presented in Figure supplement 4C.

      (4) Continuing from the above point: the authors say that the insufficient top-down drive or strength of activity from PMC ESR1 neurons is why NVCs occur. In looking closely, it also seems there is a small hump and subsequent increase in the calcium signal when the EUS bursting begins (particularly clear in Supplementary Figure 4). Could this instead mean that the bursting/urethral activity itself is feeding back onto the PMC to continue/enhance its activity, and it is instead the lack of sphincter bursting that results in the NVC? Could the authors analyze the signal during and after bursting starts? This model is consistent with one of the classic reflexes defined by Barrington, in which urethral fluid flow/activation enhances bladder contraction. The Figure 4 transection experiments do not fully answer this, as the authors are driving activity in the PMC at this time, but they could test this using PDN transection with fiber photometry recording.

      Thank you for this important point. We fully agree that EUS bursting may provide excitatory feedback to the PMC that sustains or even amplifies its activity, and that the absence of such feedback could underlie NVCs. To test this possibility, we re-analyzed the fiber-photometry traces aligned to the onset and offset of each EUS bursting (new Figure supplement 4). A small but consistent hump in the Ca<sup>2+</sup> signal appeared before bursting onset and the Ca<sup>2+</sup> signal continued to rise throughout the bursting (Figure supplement 4B, yellow arrow). The amplitude at bursting offset was significantly higher than both the NVC peak and the level recorded at bursting onset. These observations support the interpretation that urethral fluid flow/activation supplies excitatory feedback that reinforces PMC activity and bladder contraction, consistent with Barrington’s classic reflex. We have incorporated these new analyses into the revised manuscript (lines 145–155 and Figure supplement 4F).

      We agree that the positive-feedback loop described by Barrington’s classic urethra-to-bladder reflex is an intriguing mechanism. However, the PDN-transection experiment in Figure 4 was designed to determine if bladder contractions triggered by PMC<sup>ESR1+</sup> cells can proceed in the absence of sphincter bursting, not to evaluate this reflex. Incorporating simultaneous fiber-photometry recording into the PDN-transection experiment would therefore go beyond the scope of the present study. In future work we are keen to combine PDN transection with fiber photometry to further determine whether the urethra-to-bladder reflex contributes to the sustained PMC activity observed in our paradigm.

      (5) In Figure 4, is the timing of sphincter engagement different with ChR2 stimulation from what normally occurs? It appears that the bursting happens immediately upon activation whereas bladder contraction is a bit delayed.

      Thank you for this important observation. We have carefully re-examined the EMG traces from all animals shown in Figure 4. We confirm that the onset of sphincter bursting activity during ChR2 stimulation is indeed more rapid than during natural reflex voiding; nevertheless, the onset of phasic sphincter bursting during ChR2 stimulation remained delayed relative to the intravesical pressure rise (see Figure 8B).

      The immediate sphincter discharge visible in some trials was tonic EUS discharge or rare irregular bursting, not the typical EUS bursting. This tonic pattern corresponds to the spinal guarding reflex that suppresses urine leakage (Fowler et al., Nature Reviews Neuroscience. 2008, PMID: 18490916; Keller et al., Nature Neuroscience. 2018, PMID: 30104734). These segments were identified by their amplitude and spectral content and excluded from burst-onset analysis. Our analysis protocol therefore distinguishes tonic guarding activity from true phasic bursting, ensuring that only the latter was used to determine burst timing.

      (6) The explanation on line 299 about how spinal reflexes are impinging on this circuit is confusing. I agree that the bladder contraction stopping later than the EUS signal likely has something to do with spinal reflexes, but it seems this could instead be feedback from the urethral fluid flow, which continues bladder contractions (urethra-destrusor facilitative reflex). Could the authors clarify their thoughts here?

      Thank you for highlighting this ambiguity. We agree that the delayed cessation of bladder contraction could equally reflect either (1) the urethra-to-bladder facilitative reflex driven by ongoing urethral fluid flow or (2) spinal reflexes that we described. In the revised manuscript (Results, lines 343–349), we have re-worded the paragraph to make this dual possibility explicit, thereby avoiding an overly strong emphasis on spinal mechanisms alone.

      (7) A note on phrasing: the authors frequently say PMCESR1 cells drive sphincter relaxation, but then show an effect on sphincter bursting. Experienced readers might realize that relaxation and bursting are connected, but this might be confusing for readers and should be clarified in the text.

      Thank you for highlighting the potential ambiguity. We agree that the sentence “PMC<sup>ESR1</sup> cells drive sphincter relaxation” can seem paradoxical when our data show increased EUS bursting. In adult mice, the EUS does not remain continuously relaxed during voiding; instead, it generates rhythmic bursting composed of high-frequency spike clusters (active periods) alternating with low tonic activity (silent periods), resulting in rhythmic contractions and relaxations of EUS. This phasic activity acts as a pump that facilitates urine flow through the narrow rodent urethra (Kadekawa et al., Am J Physiol Regul Integr Comp Physiol, 2016, PMID: 26818058). The EUS bursting activity we recorded is consistent with the results reported in previous studies (Keller et al., Nat Neurosci, 2018, PMID:30104734; Ito et al., Elife, 2020, PMID:32347794).

      Consequently, when PMC<sup>ESR1</sup> neurons initiate bursting, they simultaneously generate the relaxation phases that separate the spikes. To make this explicit we have replaced the phrase “PMC<sup>ESR1+</sup> cells drive sphincter relaxation” with “PMC<sup>ESR1</sup> neurons trigger EUS bursting, which generates rhythmic sphincter contractions and relaxations.” (Results, page 7, lines 219-221). We have applied similar clarifications throughout the revised manuscript (Results, lines 125-129). We hope this revision eliminates any apparent contradiction.

      (8) The question remains as to which neurons (dual projecting, single projecting, or all?) are active in natural urination. This is possible to do through dual injection of retrograde virus in SPN and DGC that could coordinately turn on Gcamp, but this challenging experiment is perhaps beyond the scope of this paper. Even still, the authors could discuss their model for whether the dual- and single-projecting neurons are all engaged at once in a natural urination event. Do the authors have any data that could provide insight as to when these sub-populations are active? Results from the opto-tagging in Figure 1 (and comment #2 about single neuron firing properties) might provide a foundation for hypotheses or insights.

      Thank you for this valuable suggestion. We have now performed the experiment you proposed: dual injection of retrograde virus (AAV-Retro-Cre and AAV-Retro-DIO-GCaMP6s) in SPN and DGC were used to selectively label PMC dual-projecting neurons, and a 200-µm optic fiber was implanted above the PMC to record their Ca<sup>2+</sup> dynamics during natural urination (Figure supplement 11A and Methods, lines 470–474, 652-655). Dual-projecting neurons exhibited robust activation throughout the entire voiding phase that was tightly correlated with intravesical pressure rise and EUS bursting (Figure supplements 11A–11H). However, technical limits of current retrograde tools preclude selective isolation of single-projecting (SPN-only or DGC-only) subsets for independent fiber-photometry recordings and injection restricted to one target unavoidably labels both single- and dual-projecting cells. We now state this technical limitation explicitly (Discussion, lines 426-430).

      Accordingly, in the revised Discussion (lines 389-406), we integrate fiber-photometry Ca<sup>2+</sup> signals with single-unit data from opto-tagged recordings to propose several testable, non-mutually-exclusive models for how dual- and single-projecting PMC<sup>ESR1+</sup> neurons are engaged during natural urination: “Based on population dynamics obtained by fiber photometry (Figures 1D-1H, Figure supplements 1A-1F, and Figure supplements 11A-11H) and single-neuron firing properties recorded via optrode (Figures 1A-1C), we propose several mechanistic models for the engagement of dual- and single-projecting PMC<sup>ESR1+</sup> neurons during natural micturition. One possibility is that all three populations (dual-projecting, SPN-projecting and DGC-projecting neurons) are co-activated, with the dual-projecting subset acting as a “bridging amplifier” that sustains rising bladder pressure while coordinating EUS relaxation. Alternatively, SPN-projecting neurons may be recruited first to initiate bladder contraction, followed by DGC-projecting neurons that evoke EUS bursting and facilitate urine entry into the urethra; once flow begins, the urethro-detrusor facilitative reflex could recruit dual-projecting neurons to further enhance voiding efficiency. In addition, contextual or state-dependent urination—such as scent-marking behavior characterized by multiple voiding events with smaller volumes than reflexive urination—may predominantly rely on sequential and cooperative activation of single-projecting neurons. Other recruitment sequences remain conceivable. Future studies combining diverse urination-related behavioral paradigms with simultaneous recordings from projection-specifically labeled PMC neurons will be required to validate and refine these models.”

      Reviewer #3 (Public review):

      Summary:

      The paper by Li et al explored the role of Estrogen receptor 1 (Esr1) expressing neurons in the pontine micturition center (PMC), a brainstem region also known as Barrington's nucleus (Hou et al 2016, Keller et al 2018). First, the author conducted bulk Ca2+ imaging/unit recording from PMCESR1 to investigate the correlations of PMCESR1 neural activity to voiding behavior in conscious mice and bladder pressure/external urethral muscle activity in urethane anesthetized mice. Next, the authors conducted optogenetics inactivation/activation of PMCESR1 to confirm the contribution to the voiding behavior also conducted peripheral nerve transection together with optogenetics activation to confirm the independent control of bladder pressure and urethral sphincter muscle.

      We sincerely thank you for providing a thoughtful summary and insightful comments on our study.

      Weaknesses:

      (1) The study demonstrates that pelvic nerve transection reduces urinary volume triggered by PMC ESR1+ cell photoactivation in freely moving mice. Could the role of pudendal nerve transection also be examined in awake mice to provide a more comprehensive understanding of neural involvement?

      Thank you for this valuable suggestion. We conducted an additional experiment to determine the contribution of the pudendal nerve to PMC<sup>ESR1+</sup> neuron-driven voiding in awake mice. Bilateral pudendal nerve transection (PDNx) reduced the optogenetically evoked urine volume compared with sham-operated controls, yet photoactivation of PMC<sup>ESR1+</sup> neurons still reliably induced urination after PDNx (new Figure 6). Thus, bilateral integrity of the pudendal nerve is required for efficient PMC<sup>ESR1+</sup> neuron-driven voiding, most likely by transmitting the signals that entrain rhythmic EUS bursting. These data and experimental details have been incorporated into Figure 6, Results (lines 272–276), and Methods (lines 542–545).

      (2) While the paper primarily focuses on PMCESR1+ cells in bladder-sphincter coordination, the analysis of PMCESR1+-DGC/SPN neural circuits - given their distinct anatomical projections in the sacral spinal cord - feels underexplored. How do these circuits influence bladder and sphincter function when activated or inhibited? Also, do you have any tracing data to confirm whether bladder-sphincter innervation comes from distinct spinal nuclei?

      Thank you for this critical comment. To determine how PMC<sup>ESR1+</sup> neurons that target distinct sacral nuclei influence bladder–sphincter coordination, we first focused on the dual-projecting subset in a new experiment (Figures supplement 11 and Methods, lines 470–477, 652-655, 669-673). Dual retrograde virus injections into SPN and DGC selectively labelled PMC dual-projecting neurons, a subset of which are ESR1+. Fiber-photometry recordings showed that these cells were active during bladder contraction and sphincter relaxation (Figure supplements 11E-11H), whereas optogenetic activation reliably initiated urination: bladder pressure rose immediately and was followed by rhythmic EUS bursting (Figure supplements 11I-11N and 12B; Results, lines 309-313, 332-335). Thus, the dual-projecting sub-population is sufficient to coordinate bladder contraction with sphincter relaxation. Current retrograde tools do not allow selective isolation of single-projecting (SPN-only or DGC-only) subsets; injecting only one target unavoidably labels both single- and dual-projecting cells. Consequently, we cannot yet compare the functional impact of pure SPN-only versus DGC-only PMC populations. This limitation is now stated explicitly in the revised Discussion (lines 426–430).

      In our 2025 paper (Yan et al., Commun Biol, 2025, PMID: 40259086), we used PRV-based retrograde tracing to show that SPN and DGC constitute two separate spinal nuclei controlling the bladder and the EUS, respectively. Classic studies have reached the same conclusion (Yao et al., Nat Neurosci, 2018, PMID: 30361547; Karnup & De Groat, IBRO Reports, 2020, PMID: 32775758; Karnup, Auton Neurosci, 2021, PMID: 34391124). These citations and a concise summary have been added to the Results (lines 289–294).

      (3) Although the paper successfully identifies the physiological role of PMCESR1+ cells in bladder-sphincter coordination, the study falls short in examining the electrophysiological properties of PMC ESR1+-DGC/SPN cells. A deeper investigation here would strengthen the findings.

      Thank you for this thoughtful suggestion. While a detailed electrophysiological characterization of PMC<sup>ESR1+-DGC/SPN</sup> neurons would provide complementary information, the primary goal of the present study was to define the in vivo functional dynamics and behavioral role of these neurons during natural urination. As you suggested, further electrophysiological analysis of PMC<sup>ESR1+-DGC/SPN</sup> neurons will be an important direction for our future work.

      (4) The parameters for photoactivation (blue light pulses delivered at 25 Hz for 15 ms, every 30 s) and photoinhibition (pulses at 50 Hz for 20 ms) vary. What drove the selection of these specific parameters? Moreover, for photoactivation experiments, the change in pressure (ΔP = P5 sec - P0 sec) is calculated differently from photoinhibition (Δpressure = Ppeak - Pmin). Can you clarify the reasoning behind these differing approaches?

      Thank you for this opportunity to clarify our experimental design. The photoactivation protocol (25 Hz, 15 ms pulses) was chosen because PMC<sup>ESR1+</sup> neurons faithfully follow this frequency without depolarisation block and it reliably triggers voiding (Keller et al., Nat Neurosci, 2018, PMID:30104734). For photoinhibition we originally stated “50 Hz, 20 ms pulses”, but this was an error. Consistent with the same study (Keller et al., Nat Neurosci, 2018, PMID:30104734), we used continuous light (constant illumination) to maintain sustained suppression. The Methods section has been corrected (lines 659-661, 690-691).

      The ΔP formula was tailored to the temporal profile of each manipulation. For activation, ΔP (P<sub>5 sec</sub> - P<sub>0 sec</sub>) captures the rapid pressure rise after light onset; the same window was used in (Hou et al., Cell. 2016, PMID: 27662084). For inhibition, because saline infusion produces rhythmic reflex voiding, we delivered light at the onset of EUS bursting (i.e. when pressure was already at ~peak). Inhibition abruptly stops the bladder contraction, so the bladder cannot return to its pre-void baseline. The Δpressure (P<sub>peak</sub> – P<sub>min</sub>) was therefore used to quantify the extent to which the ongoing pressure wave was aborted by photoinhibition. P<sub>min</sub> is the lowest value reached before the next infusion-driven upswing, making the metric insensitive to the slow baseline drift produced by continuous infusion. These clarifications have been added to the Methods (Methods, lines 676-677, 679-680, 692-693).

      (5) The discussion could further emphasize how PMCESR1+ cells coordinate bladder contraction and sphincter relaxation to control urination, highlighting their central role in the initiation and suspension of this process.

      Thank you for this valuable comment. We have revised the Discussion to emphasize that PMC<sup>ESR1+</sup> neurons coordinate urination by sequentially driving bladder contraction followed by sphincter relaxation through their dual projections to the SPN and DGC. We also emphasized that this coordination is essential for the initiation and effective execution of voiding (Discussion, lines 369-388). In addition, in the revised Discussion (Discussion, lines 389-406), we integrate fiber-photometry Ca<sup>2+</sup> signals with single-unit data from opto-tagged recordings to propose several testable, non-mutually-exclusive models for how PMC<sup>ESR1+</sup> cells are engaged during natural urination.

      (6) In Figure 8, The authors analyze the temporal sequence of bladder pressure and EUS bursting during natural voiding and PMC activation-induced voiding. It would be acceptable to consider the existence of a lower spinal reflex circuit, however, the interpretation of the data contains speculation. Bladder pressure measurement is hard to say reflecting efferent pelvic nerve activity in real time. (As a biological system, bladder contraction is mediated by smooth muscle, and does not reflect real-time efferent pelvic nerve activity. As an experimental set-up, bladder pressure measurement has some delays to reflect bladder pressure because of tubing, but EUS bursting has no delay.) Especially for the inactivation experiment, these factors would contribute to the interpretation of data. This reviewer recommends a rewrite of the section considering these limitations. Most of the section is suitable for the results.

      We agree with the reviewer that bladder pressure, mediated by smooth muscle contraction, provides an indirect measure of efferent pelvic nerve activity and is subject to both physiological and experimental delays. Regarding potential delay from the tubing system, pressure propagates in fluid at approximately 1000 m/s (Kela & Pekka, Proceedings of World Academy of Science Engineering & Technology, 2009, DOI: 10.5281/zenodo.1080526). Given that the total tubing length in our setup is 0.5-1 meter, this gives an estimated transmission delay of only 0.5-1 ms. However, this delay is negligible compared with the observed time difference (~700 ms) between the cessation of EUS bursting and the termination of bladder contraction. Theoretically, pressure transmission is not expected to introduce a temporal delay. However, we cannot exclude the possibility that the pressure measurement itself may impose such a delay, because bladder pressure does not necessarily reflect efferent pelvic nerve activity in real time. Future studies using simultaneous recordings of bladder pressure and pelvic nerve discharges will help clarify whether a true temporal delay exists. Nevertheless, we agree that additional physiological or peripheral factors may also contribute to this difference in timing. As suggested by the reviewer, we have revised the discussion to consider the potential influence of other factors, such as urethra-detrusor facilitative reflex (Results, lines 343-349).

      Reviewer #3 (Recommendations for the authors):

      (1) In opto-tag experiments, a comparison of average AP waveform during behavior and during light stimulation should be included as criteria. It should be mostly the same waveform.

      Thank you for bringing this to our attention. We have now added this comparison as an inclusion criterion in the revised manuscript. Figure supplement 3B shows representative examples of the average waveforms, and Figure supplement 3C displays the distribution of correlation coefficients between spontaneous and light-evoked spikes for all recorded PMC<sup>ESR1+</sup> units, all of which exhibited r > 0.8.

      (2) Optical fiber implantation seems to be done in two different methods. In Figure 1 and Figure 2, the fiber tip is positioned just above PMC but in Figure 3 it seems to be angled. The information should be included in the Methods section.

      Thank you for this important comment. We have now clarified in the Methods that for Figures 1 and 2, the optical fibers were implanted vertically above the PMC, whereas for Figure 3, the left optical fiber was implanted at a 33° lateral angle targeting the PMC (Methods, lines 499-503).

      (3) In the closed-loop inhibition experiments of Figure 2, the parameters to start closed-loop photo-inactivation were not described in the method. If it is a manual closed loop, it should be described clearly.

      Thank you for raising this important point. We apologize for omitting these details in the original Methods. We have now added a complete description of the manual closed-loop photo-inhibition protocol, including the triggering criteria and operator-controlled timing, in the revised Methods section (lines 602–605).

      (4) In Figure 7A/E the authors provide a spinal cord image to show the injection site, but the image is misleading. The figure only shows AAV-infected CRH/ESR1 neurons in the spinal cord section. It does not indicate the AAV injection site or the terminal distribution.

      Thank you for your important comment. We apologize for providing a spinal cord image that did not accurately depict the injection site. To rigorously verify that our spinal injections were confined to SPN or DGC, we performed new retrograde-tracing experiments in ESR1-Cre and CRH-Cre mice. A mixture of AAV-Retro-DIO-mCherry or AAV-Retro-DIO-EGFP with the retrograde tracer CTB-647 was injected specifically into SPN or DGC. Only animals in which CTB-647 fluorescence was strictly limited to the target nucleus, without spread to the adjacent region, were included (new Figures 7A and 7E). These data confirmed our original observations and have been pooled in Figure 7. The manuscript and figure have been updated accordingly (Results, lines 297-301, 304-306; Methods, lines 465–466).

    1. eLife Assessment

      This manuscript applies a theoretical analysis to two published datasets on yeast and bacterial evolution to compare different ways of quantifying fitness. It makes an important advance by clarifying how discrepancies can arise by using different approaches and provides recommendations for best practices. Overall, this is an impressive and highly beneficial study that is based on convincing evidence and has the potential of setting standards in this rapidly growing field.

    2. Reviewer #1 (Public review):

      The authors point out that the fitness estimates obtained from different experimental assays (monoculture, pairwise competition or bulk competition) are not generally equivalent, not even with regard to the fitness ranking of different genotypes. Using a computational model based on experimentally measured growth phenotypes for knockout strains in yeast, as well as data from Lenski's Long Term Evolution Experiment (LTEE), they derive a set of best practice rules aimed at extracting the optimal amount of information from such experiments.

      The study is very complete on a technical level, and the conceptual weaknesses raised in the first round of reviews have been fully addressed in the revision.

    3. Reviewer #2 (Public review):

      Summary:

      The manuscript "Quantifying microbial fitness in high-throughput experiments" provides a comprehensive analysis of the various approaches to quantifying fitness in microbial evolution, focusing on three primary factors: encoding of relative abundance, time scale of measurement, and the choice of reference subpopulation. The authors systematically explore how these choices impact fitness statistics and provide recommendations aimed at standardizing practices in the field. This manuscript aims to highlight the impact of differing fitness definitions and the methodologies utilized for analysis and how that can significantly alter interpretations of mutant fitness, affecting evolutionary predictions and the overall understanding of genetic interactions in the experiments.

      Strengths:

      The choices for quantifying fitness in evolution experiments are critical and highly relevant given the increasing prevalence of high-throughput experiments in evolutionary biology. The authors methodically categorize fitness statistics and their implications, providing clarity on a complex subject. This structured approach aids in understanding the nuances of fitness measurement. The manuscript effectively highlights how different choices in fitness measurement can influence fitness rankings and the understanding of epistasis, which is important for modeling evolutionary dynamics.

      Comments on revisions:

      The authors have comprehensively addressed all previous comments and suggestions. In particular, the addition of the new methods section: 'A guide to calculate pairwise relative fitness under the logit encoding from bulk competition data' - significantly improves the clarity of the implementation and helps in the overall interpretation of the framework.

    4. Reviewer #3 (Public review):

      Summary:

      The authors present analyses of different fitness measures derived from empirical data from yeast knock-out mutants and the long-term evolution experiment (LTEE) with Escherichia coli to explore discrepancies and identify preferred methods to estimate relative fitness in high-throughput experiments. Their work has three components. They first discuss the different "encodings" of relative abundance data and conclude that logit-transformations are preferred, because they transform nonlinear abundance trajectories into linear trajectories with greater predictive power. Next, they compare per-generation with per-growth cycle relative fitness estimates inferred from simulations of pairwise competitions based on published growth traits for the yeast strains and on published pairwise competition measurements for the LTEE data. Both data sets show quantitative and qualitative (i.e. rank order) discrepancies of estimates across different time scales, which are highlighted by considering possible underlying causes (i.e. trade-offs between growth traits) and consequences (i.e. epistasis among mutations affecting different growth traits). Finally, the authors compare simulated pairwise and bulk (i.e. where many mutants compete during a growth cycle in a single environment) competition assays based on the yeast knock-out mutants and demonstrate an optimal ratio of collective mutants to wild-type strains that minimizes both sampling error and overestimation of fitness estimates when compared with pairwise competitions.

      Strengths:

      The study deals with a highly relevant topic. Fitness is central to general evolutionary theory, but also poorly defined and implies different traits for different organisms and conditions. For microbes, which are often used in evolution experiments, high-throughput experiments may yield different measures to quantify abundance over time, from individual growth traits to bulk competition experiments. Hence, it is relevant to consider discrepancies among those measures and identify preferred measures with respect to predicting population dynamic and evolutionary processes. The present study contributes to this aim by (i) making readers aware of differences among commonly used fitness estimates, (ii) showing that simulated (yeast) and calculated (E. coli) competitive fitness may differ across time scales, and (iii) showing that bulk competitions may yield relative fitness estimates that are systematically higher than pairwise competitions. The study is rather thorough on the theory side, with extensive derivations and analyses of various fitness measures using their resource competition model in the Supplementary Information. The study ends with a few practical recommendations for preferred methods to infer relative fitness estimates, that may be useful for experimentalists and stimulate further investigations.

      Weaknesses:

      The study has a few limitations. Perhaps the most apparent limitation is the lack of a clear answer to the question which fitness measure is best "in the light of first principles". The authors show clear discrepancies between fitness estimates across different time scales or using different reference genotypes in bulk competition and provide useful recommendations based on practical considerations (e.g. using pairwise competitions as "golden standard"), but it remains unclear whether these measures provide the greatest value for the questions researchers may want to answer with them (e.g. predict shifts in genotype frequencies). -- The authors have convinced me in their response that their recommendations were fundamentally related to the resource competition model, and the changes in introduction and discussion help to appreciate the choice of fitness measure in relation to the research question.

      A second limitation is that the authors analyse fitness differences arising solely from resource competition, whereas microbes often interact via other mechanisms, e.g. the production of anticompetitor toxins, cross-feeding of metabolites or lack of growth to enhance their persistence in stress conditions. Without simulations of these processes, understanding discrepancies among fitness measures is necessarily limited. In addition, the analysis of trade-offs between growth traits causing these discrepancies during resource competition seems confounded by biases in measurement error or parameter estimation, at least for growth rate and lag time (Fig. 2B), where the replicate estimates for the wildtype show a similar negative correlation. -- The motivation to use a resource competition model for fitness inference is generally well motivated now. I accept their argument that resource competitive differences are most important for microbial strains with small genetic differences (e.g. from mutant libraries or from the same evolution experiment). However, it is relevant to note that this ignores situations that are rather common, where the wild-type strain produces an anticompetitor toxin or causes growth inhibition through metabolite products that lower the pH (and derived strains will likely contain resistant mutations).

      Third, the study does not validate relative fitness predictions from growth traits (as is done for the yeast mutants) with measured relative fitness estimates using competition assays, while such data are available, e.g. for the LTEE. This would strengthen their inferences about preferred fitness measures. -- In their response, the authors explain that their aim was different, i.e. the provide "proof of principle" that the choices of fitness measure can produce discrepancies even when they follow the same growth model.

      Fourth, the analysis of epistasis between mutations affecting different growth traits (shown in Fig. 3) based on the LTEE data could be better introduced and analysed more comprehensively. Now, the examples given in panels C-F seem rather idiosyncratic and readers may wonder how general these consequences of using fitness estimates based on different time scales are. -- The authors have made extensive improvements to address how different growth parameters, especially lag and growth rate, differently affect apparent epistasis based on measures at different time scale (per generation vs per cycle). These provide a more comprehensive analysis of down-stream consequences for epistasis detection.

      Finally, the study is generally less accessible to experimentalists due to the extensive and principled treatment of specific population dynamic models and fitness inferences. This may distract from the overarching aim to identify fitness measures that are most accurate and useful for predictions of population dynamic and evolutionary processes. In this light, the motivation for the initial discussion of the importance of how to best encode relative abundance (Fig. 1) is unclear. Also, the conclusion, that logit encoding is preferred, because it linearizes logistic growth dynamics and "improves the quality of predictions", is not further motivated. Experimentalists using non-linear models to infer fitness from growth curves or competition assays may miss the relevance of this discussion. -- Thanks for this explanation (indeed, I confused "logistic dynamics" with "logistic growth model"); the additional explanations and text reductions have improved accessibility for experimentalists.

      Comments on revisions:

      I appreciate the thorough and effective response to all recommendations and have no further comments.

    5. Author response:

      The following is the authors’ response to the original reviews.

      We thank both editors and the three reviewers for their constructive criticism of our work. As a result of these comments, we have made several significant revisions to the paper that we believe strengthen and clarify our major results:

      (1) Following suggestions from Reviewers #1 and #3, we have have improved our introduction to the different fitness concepts (lines 105–148) and streamlined the discussion of the logit encoding (lines 175–190). In particular, we have moved the most technical points to the SI (Sec. S3).

      (2) Based on criticisms of our usage of the population dynamics model from Reviewers #1 and #3, we significantly revised our explanation of the motivation and interpretation of this model (lines 284–310 and 323–336) and our discussion of the generalizability of these results (lines 678–728), including the possible effects of interactions besides resource competition.

      (3) Following a request from Reviewer #3, we have expanded our analysis of epistasis to systematically test all possible double mutants between qualitative types of trait perturbations in the model. We have added a new main text figure (Fig. 3), new SI figures (Figs. S9–S15), a new subsection in the Results (lines 344–395), and corresponding new sections in the Methods (lines 864–892) and SI (Sec. S8).

      (4) Following concerns from Reviewers #2 and #3 about the limited empirical data, we have expanded our analysis of the LTEE data (new main text Fig. 4, revised text on lines 416–439, and revised SI Figs. S16–S18) and have analyzed two new benchmarking datasets for bulk fitness to test our predictions (new main text Fig. 6, new Results subsection on lines 561–590, and new SI Figs. S24 and S25).

      (5) Following the criticism of Reviewer #3 about the lack of a clear recommendation on fitness quantification that provides the greatest value for a given scientific question, we have better explained what we think the scientific consequences of fitness are as a motivation for our analysis (lines 82–88, 319–322, and 615–630) and replaced the final flowchart figure with a step-by-step guide in the Methods to implement our recommendations in practice (lines 964–982).

      Reviewer #1 (Public review):

      The authors point out that the fitness estimates obtained from different experimental assays (monoculture, pairwise competition, or bulk competition) are not generally equivalent, not even with regard to the fitness ranking of different genotypes. Using a computational model based on experimentally measured growth phenotypes for knockout strains in yeast, as well as data from Lenski’s Long Term Evolution Experiment (LTEE), they derive a set of best practice rules aimed at extracting the optimal amount of information from such experiments.

      The study is very complete on a technical level and I have no suggestions for further analyses. However, I feel the readability and the conceptual focus of the manuscript could be significantly improved by rearranging the material with regard to the contents of the main text vs. the Methods and the Supplement. Detailed recommendations:

      (1) Regarding readability, the large number of references to material in the Methods and Supplement fragment the main text and make it difficult to follow.

      We understand the challenges these references pose to the flow of the main text; we have attempted to keep those references to a minimum, while ensuring that technical details of the work are fully documented and referenced for completeness.

      (2) Conceptually, it seems to me that the current presentation obscures the reasons why we should care about fitness in the first place. In the first paragraph of Results, the authors define fitness “as any number that is sufficient to predict the genotype’s relative abundance x(t) over a short-time horizon”. To me, this seems like an extremely narrow and not very interesting definition. Instead, I view fitness as an intrinsic property of a genotype that allows us to predict its performance under a range of conditions, including in particular conditions that are different from the experimental setup that was used to obtain the fitness estimates. The latter viewpoint is well expressed in Supplementary Section S1, where the authors discuss the notion of fitness potential. I would recommend to move at least part of this discussion to the main text.

      We appreciate the reviewer’s viewpoint and have moved that conceptual discussion from the SI to the beginning of the Results section to give readers a broader perspective on fitness (lines 105–148). We use “potential” in analogy with potential energy in physics and have clarified this on lines 126–135.

      What we call fitness potential, like the other notions of fitness we discuss in this paper (relative and absolute fitness), is still specific to an environmental condition. Fitness as a property intrinsic to a genotype and independent of any environment, as the reviewer mentions, is an interesting concept but beyond the scope of this paper, which is focused on analyzing fitness measurements that are inevitably environment-specific and we have clarified this on lines 142–148. While it is true that this definition of fitness is narrow, it is what can be empirically measured directly, and thus we believe it is crucial to understand how to best interpret that data.

      By comparison, the arguments in favor of the logit encoding that currently opens the Results session are rather straightforward and could be shortened significantly.

      We agree and have condensed this section (lines 175–192).

      (3) Similarly, the modeling strategy used in this work is quite subtle and needs to be explained more fully in the main text. The authors use growth traits (lag time, growth rate, and yield) extracted from monoculture experiments on a yeast knockout collection and feed them into a specific mathematical model to simulate pairwise and bulk competition scenarios. Since a key claim of the work is that monoculture experiments are generally poor predictors of competitive fitness, the basis for this conclusion and the assumptions on which it is based need to be described clearly in the main text. In the current version of the manuscript, this information has been largely relegated to the Methods section.

      We agree that our motivation for the population dynamics model and growth curve data was not clearly explained. We have significantly revised this section of the Results in the main text (lines 284–310).

      In particular, we recognize the potential for misunderstanding this material we do not intend the relative fitness values calculated from this model to be interpreted as predictions of the true relative fitness between yeast deletion strains. Rather, we use the population dynamics model for our proof of principle: that the most basic features of microbial population dynamics in laboratory experiments, as captured by this model (resource competition, lag phase, growth phase, saturation), are sufficient to create discrepancies between common fitness statistics used in these experiments (different encodings, time scales, choices of reference subpopulations). We have added a statement to highlight existing work on monoculture predictors for competition outcomes [32, 34, 36, 37] on lines 453–459.

      Reviewer #1 (Recommendations for the authors):

      In the discussion of the LTEE in Section S8, the authors write on page 8 that “we couldn’t fit the fitted values a,b in ref. 29 so we were unable to check it”. I don’t understand this sentence - is the claim that the fit in ref. 29 was incorrect?

      We have clarified this point in the SI (now Sec. S9). Our point was not that the fit in Wiser et al. 2013 is incorrect, but merely that we could not find the exact values of the fitted parameters they obtained documented in their paper, so we could not compare our own fitted parameters directly to theirs.

      Also, at the end of the section, the authors refer to theory work on the long-term fitness trend in the LTEE. Here, two early references arguing for a logarithmic increase in fitness could be mentioned as well:

      International Journal of Modern Physics B 12,:361-391 (1998) Evolution and Extinction Dynamics in Rugged Fitness Landscapes Paolo Sibani, Michael Brandt, and Preben Alstrøm

      J. Stat. Mech. (2008) P04014 Evolution in random fitness landscapes: the infinite sites model Su-Chan Park and Joachim Krug

      We thank the reviewer for providing these two references and have added them to the list of previous works on long-term fitness trends at the end of the section (now Sec. S9).

      Reviewer #2 (Public review):

      Summary:

      The manuscript “Quantifying microbial fitness in high-throughput experiments” provides a comprehensive analysis of the various approaches to quantifying fitness in microbial evolution, focusing on three primary factors: encoding of relative abundance, time scale of measurement, and the choice of reference subpopulation. The authors systematically explore how these choices impact fitness statistics and provide recommendations aimed at standardizing practices in the field. This manuscript aims to highlight the impact of differing fitness definitions and the methodologies utilized for analysis and how that can significantly alter interpretations of mutant fitness, affecting evolutionary predictions and the overall understanding of genetic interactions in the experiments. Although this manuscript focuses on a critical issue in the quantification of fitness in high throughput experiments, it heavily relies on only one experimental dataset (Warringer et al 2003) and one organism i.e, Yeast (Saccharomyces cerevisiae) grown in a defined medium, the environmental influence is not completely captured. While the theoretical framework is strong, more experimental examples with more organisms (i.e., more datasets) in their analysis and comparison would enhance the manuscript, especially its conclusion.

      We have expanded our analysis of competition data from the Long-Term Evolution Experiment in E. coli (lines 416– 439), including adding a main text figure (Fig. 4) along with the three SI figures (Figs. S16–S18). We have also added two completely different data sets that directly test our predicted discrepancies in fitness estimates from bulk competition experiments. From this data we have added a new main text figure (Fig. 6), two new SI figures (Figs. S24 and S25), and a new section at the end of the Results (lines 563–590).

      We wish to clarify, though, that the aim of this study is to develop theory on fitness quantification choices and minimal examples to demonstrate the potential for discrepancies between these choices. While we appreciate the reviewer’s interest in understanding how discrepancies in fitness statistics vary across organisms and environments, that is an empirical question beyond the scope of this paper.

      Strengths:

      The choices for quantifying fitness in evolution experiments are critical and highly relevant given the increasing prevalence of high-throughput experiments in evolutionary biology. The authors methodically categorize fitness statistics and their implications, providing clarity on a complex subject. This structured approach aids in understanding the nuances of fitness measurement. The manuscript effectively highlights how different choices in fitness measurement can influence fitness rankings and the understanding of epistasis, which is important for modeling evolutionary dynamics.

      Weaknesses:

      The theoretical framework is robust, but the manuscript could benefit from more empirical examples to illustrate how different fitness quantification methods lead to varied conclusions in experiments.

      Please see our response to the previous comment on this point.

      The discussion on the choice of reference subpopulation could be expanded with the influence of the environment or the condition. Different types of reference groups might yield different implications for fitness calculations, and further elaboration would enhance this section.

      While we agree that studying how environmental conditions affect fitness is an important and interesting problem, it goes beyond the scope of this paper, which focuses on the basic theory of quantifying microbial fitness from highthroughput experiments. Applications of this theory to empirical questions about environmental variation would be best served by their own studies. We have added a statement clarifying this goal (lines 144–148).

      We are unsure how the choice of reference subpopulation is related to this issue. In our view, if the goal of a mutant fitness measurement is to predict how that mutant would behave when arising spontaneously and competing against its immediate ancestor, the gold-standard reference subpopulation must always be the mutant’s immmediate ancestor, or another mutant that is known to be phenotypically equivalent to the ancestor (e.g., neutral mutants in the case of a large mutant library). Other choices of reference subpopulations would not provide directly meaningful information in this regard.

      The authors overgeneralize some findings; for instance, the implications of fitness measurement choices could vary significantly across different microbes or experimental conditions. A more detailed discussion would strengthen the conclusion.

      We certainly agree that the consequences of fitness quantification choices could vary significantly across organisms and environments; our goal for this paper is to demonstrate what discrepancies are possible in principle and in particular how they depend on basic features of microbial population dynamics (e.g., variation in yield). We have added two separate paragraphs in the Discussion section to address the generalizability of our results in the context of pairwise (lines 678–710) and bulk fitness measurements (lines 711–728).

      Overall, this manuscript is a significant contribution to the field of evolutionary biology, addressing a critical issue in the quantification of fitness but lacks more experimental support to make it a wider claim. By systematically exploring the factors that influence fitness measurements, the authors provide valuable insights that can guide future research - the framework is computationally thorough but needs a more detailed explanation of concepts instead of generalizing.

      We have improved our explanation of several of the important concepts. In particular, we have significantly revised our explanation of the population dynamics model (lines 284–310) to emphasize its role as a null model to demonstrate how fundamental aspects of microbial growth are sufficient to cause discrepancies between fitness statistics. We have also revised two paragraphs on the generalizability of our results in the Discussion section (lines 678–728).

      Further work is needed, particularly to incorporate empirical examples and expand certain discussions to include environmental variation and their impact, which would improve clarity and applicability.

      We have added a sentence at the beginning of the Results section to acknowledge the environmental dependence of fitness (lines 142–148). We believe further discussion of that issue is beyond the scope of this paper, as it would require a significant amount of additional data and/or environmental modeling.

      Reviewer #2 (Recommendations for the authors):

      In addition to the comments from the previous sections, other specific comments:

      (1) Figure 5 needs to be populated with additional parameter details. For example, include brief descriptions of each parameter involved in the encoding, time scale, and reference choices. This will help users understand the implications of each choice. Adding these details will make the flow diagram more comprehensive, aiding researchers in implementing these steps more clearly.

      Following this comment and another comment about this figure from Reviewer #3, we decided to replace this figure with a new Methods section with step-by-step instructions (lines 964–982).

      (2) Duplication in Line 620: “Nevertheless, the fact that we see the fact that we see...” This redundancy needs to be corrected.

      We thank the reviewer for pointing this out; we have rewritten this paragraph.

      (3) More experimental data comparisons and their assessment concerning various microbial systems and multiple environmental conditions are recommended to support the claim.

      Please see our responses to the related public comments.

      Reviewer #3 (Public review):

      Summary:

      The authors present analyses of different fitness measures derived from empirical data from yeast knockout mutants and the long-term evolution experiment (LTEE) with Escherichia coli to explore discrepancies and identify preferred methods to estimate relative fitness in high-throughput experiments. Their work has three components. They first discuss the different “encodings” of relative abundance data and conclude that logit transformations are preferred because they transform nonlinear abundance trajectories into linear trajectories with greater predictive power. Next, they compare per-generation with per-growth cycle relative fitness estimates inferred from simulations of pairwise competitions based on published growth traits for the yeast strains and on published pairwise competition measurements for the LTEE data. Both data sets show quantitative and qualitative (i.e. rank order) discrepancies of estimates across different time scales, which are highlighted by considering possible underlying causes (i.e. trade-offs between growth traits) and consequences (i.e. epistasis among mutations affecting different growth traits). Finally, the authors compare simulated pairwise and bulk (i.e. where many mutants compete during a growth cycle in a single environment) competition assays based on the yeast knock-out mutants and demonstrate an optimal ratio of collective mutants to wild-type strains that minimizes both sampling error and overestimation of fitness estimates when compared with pairwise competitions.

      Strengths:

      The study deals with a highly relevant topic. Fitness is central to general evolutionary theory, but also poorly defined and implies different traits for different organisms and conditions. For microbes, which are often used in evolution experiments, high-throughput experiments may yield different measures to quantify abundance over time, from individual growth traits to bulk competition experiments. Hence, it is relevant to consider discrepancies among those measures and identify preferred measures with respect to predicting population dynamics and evolutionary processes. The present study contributes to this aim by (i) making readers aware of differences among commonly used fitness estimates, (ii) showing that simulated (yeast) and calculated (E. coli) competitive fitness may differ across time scales, and (iii) showing that bulk competitions may yield relative fitness estimates that are systematically higher than pairwise competitions. The study is rather thorough on the theory side, with extensive derivations and analyses of various fitness measures using their resource competition model in the Supplementary Information. The study ends with a few practical recommendations for preferred methods to infer relative fitness estimates, that may be useful for experimentalists and stimulate further investigations.

      Weaknesses:

      The study has several limitations. Perhaps the most apparent limitation is the lack of a clear answer to the question of which fitness measure is best “in the light of first principles”. The authors show clear discrepancies between fitness estimates across different time scales or using different reference genotypes in bulk competition and provide useful recommendations based on practical considerations (e.g. using pairwise competitions as the “golden standard”), but it remains unclear whether these measures provide the greatest value for the questions researchers may want to answer with them (e.g. predict shifts in genotype frequencies).

      We agree on the importance of considering the scientific questions researchers want to answer in determining the best way to quantify fitness. We have revised both the Introduction (lines 82–88) and the Discussion (lines 615–630) to more clearly explain possible downstream questions researchers may wish to answer with fitness data, and thus why discrepancies in that data based on analysis choices may be important.

      We believe that the text does provide a specific recommendation (second subsection of the Discussion, lines 635– 658) for how to quantify relative fitness: using the logit encoding (rather than other encodings), measuring fitness per-cycle (rather than per-generation), and using the wild-type or a phenotypically-equivalent proxy as reference subpopulation to calculate pairwise fitness in a bulk competition (rather than using the mutant library as a whole). This recommendation is based on first principles: the logit encoding is based on the principle of the logistic equation as the null model of relative abundance dynamics (lines 635–637), the choice of the per-cycle timescale is based on the principle that in non-steady state environments the time scale for measuring selection should not depend on the wild-type growth (lines 640–645), and the choice of reference population is based on the principle that a mutant’s fitness should serve as a predictor of its dynamics when arising de novo at low frequency and competing against its wild-type (lines 648–653).

      A second limitation is that the authors analyse fitness differences arising solely from resource competition, whereas microbes often interact via other mechanisms, e.g. the production of anticompetitor toxins, cross-feeding of metabolites, or lack of growth to enhance their persistence in stress conditions. Without simulations of these processes, understanding discrepancies among fitness measures is necessarily limited.

      We agree that other interactions are important in many microbial ecosystems and could affect measurements of fitness. We discuss the possibility of these other interactions and their potential consequences for fitness on lines 697– 710.

      We focus on resource competition in this paper, however, for two reasons. One is that we are using it as a null model: resource competition is always present, and thus it provides an important baseline for discrepancies in fitness statistics in the absence of any other assumptions. Indeed, our results are that this minimal assumption alone is sufficient to produce a wide range of significant discrepancies, which provides the proof of principle that choices of fitness quantification matter. We have clarified this in a revised explanation of the population dynamics model on lines 294–304.

      The second reason is that fitness measurements of the type discussed in this paper are typically performed on mutants that have only small genetic differences with their ancestor (e.g., a point mutation or gene deletion). While more complex interactions between such similar genotypes are not impossible, we expect them to be rare, in which case resource competition is the only interaction. Explicit modeling of other interactions is an important question for future work, but would require more detailed models and data of those phenomena, and thus would go beyond the scope of the present study. We have added a sentence to explain our emphasis on resource competition on lines 298–301 and 690–697.

      In addition, the analysis of trade-offs between growth traits causing these discrepancies during resource competition seems confounded by biases in measurement error or parameter estimation, at least for growth rate and lag time (Figure 2B), where the replicate estimates for the wildtype show a similar negative correlation.

      The tradeoff between growth traits was only an incidental observation and is not necessary for the fitness statistic discrepancies we analyze in this paper; the only important pattern in the growth traits is the existence of mutants with reduced yields (so as to reduce the wild-type log fold-change in a competition) as well as variation in one other trait under selection (lag time or growth rate in this model). We have clarified this mechanism on lines 328–336, which is demonstrated by Fig. S7. Since these tradeoffs are not relevant to the results and we agree that their significance may be unreliable due to the noisiness of the data, we have removed mention of them.

      Third, the study does not validate relative fitness predictions from growth traits (as is done for the yeast mutants) with measured relative fitness estimates using competition assays, while such data are available, e.g. for the LTEE. This would strengthen their inferences about preferred fitness measures.

      The goal of our modeling with the yeast growth trait data is not to test the ability to predict competition experiments from monoculture data; that has been the focus of previous studies [32, 34, 36, 37]. Rather, we use the population dynamics model for a proof of principle: that the most basic features of microbial population dynamics in laboratory experiments, as captured by this model (resource competition, lag phase, growth phase, saturation), are sufficient to create discrepancies between common fitness statistics used in these experiments (different encodings, time scales, choices of reference subpopulations). The yeast growth curve data merely provides realistic parameters for this model, to ensure we are studying a biologically relevant regime of the dynamics. To avoid this misconception, we have revised our explanation of this model and the data on lines 284–310.

      Fourth, the analysis of epistasis between mutations affecting different growth traits (shown in Figure 3) based on the LTEE data could be better introduced and analysed more comprehensively. Now, the examples given in panels C-F seem rather idiosyncratic and readers may wonder how general these consequences of using fitness estimates based on different time scales are.

      We agree that this analysis was incomplete and missed an opportunity to emphasize this important consequence of fitness quantification. We have thus expanded this analysis into a systematic test of all possible double mutants between qualitative types of trait perturbations in the model. We have added a new main text figure (Fig. 3), new SI figures (Figs. S9–S15), a new subsection in the Results (lines 346–395), and corresponding new sections in the Methods (lines 864–892) and SI (Sec. S8).

      Finally, the study is generally less accessible to experimentalists due to the extensive and principled treatment of specific population dynamic models and fitness inferences. This may distract from the overarching aim to identify fitness measures that are most accurate and useful for predictions of population dynamics and evolutionary processes.

      We appreciate this concern as we do hope to make the paper as broadly accessible as possible, especially to experimentalists who measure microbial fitness. To this end, we have reduced the technical discussion of encodings in the first section of the Results (lines 164–187); revised explanations of the population dynamics model (lines 284–310), importance of growth trait variation (lines 328–336), and epistasis (lines 346–395) to better emphasize the conceptual intuition of these parts; and added a step-by-step guide for our recommended best practices of quantifying fitness in bulk competition experiments (lines 964–982).

      In this light, the motivation for the initial discussion of the importance of how to best encode relative abundance (Figure 1) is unclear. Also, the conclusion, that logit encoding is preferred, because it linearizes logistic growth dynamics and “improves the quality of predictions”, is not further motivated. Experimentalists using non-linear models to infer fitness from growth curves or competition assays may miss the relevance of this discussion.

      The motivation for the discussion of encodings is that it is one of the choices made differently by researchers, mainly using either the logit (more common in experimental evolution and population genetics studies) or log encoding (more common in TnSeq analyses). As such we believe it is important to explain where this choice comes from (a transformation of relative abundance data to make it approximately linear in time, and thus amenable to characterization by a single slope parameter) and why we believe the logit encoding is more logical in most cases. We have streamlined and revised this subsection to make it clearer (lines 164–187).

      Our argument for favoring the logit encoding in most cases is based on the logistic model being a null model for relative abundance dynamics (Sec. S3). In light of the reviewer’s comments, we have realized this may be confusing because there are two common usages of logistic dynamics that are biologically distinct. What we mean by logistic model is the dynamics of relative abundance x of a mutant in competition with other genotypes:

      Here s turns out to be the relative fitness under the logit encoding. On the other hand, researchers also use a logistic ODE to describe the dynamics of absolute abundance N of a single strain in monoculture (e.g., as in a growth curve):

      We believe the reviewer’s last point refers to Eq. (2), whereas our argument about the logit encoding is based on Eq. (1). We have added a note to clarify this distinction for the reader (lines 192–196).

      Reviewer #3 (Recommendations for the authors):

      In addition to my general comments in the public review, I have several more specific recommendations:

      (1) Line 183-189: unclear why logit-based relative fitness is preferred. Abundance data are not typically binomial.

      We agree this claim about abundance data was incorrect and have removed it. We have revised the section to focus on motivating the logit encoding from logistic dynamics of relative abundance as a null model for most systems (main text lines 175–187 and Sec. S3).

      (2) Line 205: it may be mentioned that s(logit) is the same as the “selection rate constant” often used in microbial studies.

      We have added a sentence clarifying the equivalence of the logit-encoded relative fitness to the selection coefficient in population genetics (lines 188–190).

      (3) Line 368: why do mutations that increase biomass yield also increase WT LFC? Is this, because they grow slower and hence allow the WT more time to grow?

      Mutants with higher yield allow the wild-type to achieve higher log fold-change because those mutants consume fewer resources per cell, which frees up more resources for the wild-type to consume and increase its overall growth. It’s not about growth rate or time, as this would occur even for mutants whose growth rates are identical to the wild-type’s. We have revised our explanation of how variation in growth traits differentially affects fitness statistics (lines 323–340) and epistasis (lines 361–378).

      (4) Line 382-386: you may want to cite Ram et al. (2019, 10.1073/pnas.1902217116), who also did such analyses for experimental data from E. coli.

      We have cited this work as Ref. [34].

      (5) Line 415: perhaps use “bulk relative fitness” instead of “total relative fitness”, to contrast with “pairwise relative fitness”.

      We acknowledge the language in this section can be subtle. However, “bulk” is not a sufficient identifier for the concept of total relative fitness as bulk competition experiments (with many genotypes competing simultaneously) can be used to measure either total relative fitness or pairwise relative fitness. (In pairwise competition experiments with only two genotypes, these two types of fitness are identical.) As such we adhere to our original language but have added words to clarify which type of experiment (bulk or pairwise) we are talking about in a given context (e.g., on lines 495–504).

      (6) Line 451-453: why does a population in bulk competition consume resources more slowly than in pairwise competitions?

      Mutant libraries used in bulk competition experiments usually include a large number of deleterious mutants, which grow more slowly than the wild-type. Thus these populations typically consume resources more slowly than a population in a pairwise competition would, where a large part of the population is the wild-type.

      (7) Line 565: I don’t understand how one can compare relative fitness to other timescales.

      Relative fitness, as we’ve defined it, has units of rate, since it describes the rate of change of relative abundance (or an encoding of it) over some time scale (e.g., a batch growth cycle or a generation). Therefore it can be compared to other times scales of the system, such the rate of new mutations arising or the rate of genetic drift fluctuations, as long as they are measured in the same units. This comparison is important to population genetics analyses, such as determining whether the population is in the strong selection-weak mutation limit or the clonal interference regime.

      (8) Line 620 repeats text.

      Thank you, we have revised this paragraph and removed the typo.

      (9) Figure 1C+D: the link between the scenarios on the left and the graphs on the right may be better explained. For example, it may help to make explicit that the 4 scenarios in panel C show the same relative fitness per cycle and that mutant and wildtype have the same growth rate, but different growth periods in both scenarios in panel D. It is also unclear whether the grey dot links to the upper scenario in D.

      We have clarified this issue in the caption and changed the colors to avoid this confusion.

      (10) Figure 2E: it is unclear why “mutants with equal fitness are assigned the lowest rank”.

      This was a technical comment about how to handle ties in our analysis of mutant rankings, but it is moot since no exact ties actually occur in our simulations. We have removed this remark to avoid confusion.

      (11) Figure 2F: the axis labels are confusing, as for the WT estimates no LFC mutant exists. It would also help to make explicit in the legend against which WT replicate/reference strain each strain has competed.

      We agree the inclusion of wild-type replicates in this plot was confusing and unnecessary, so we have removed them. The mutants compete against a wild-type with traits defined by their median values across all wild-type replicates; this is noted in Fig. 2A and the Methods section on our analysis of this data (lines 809–813).

      (12) Figure 5: I am not sure this is needed, as its information is rather limited.

      We agree and have removed this figure.

    1. eLife Assessment

      This is a valuable study presenting solid data indicating that the bacterial GTPases EngA and ObgE enable single-step reconstitution of functional 50S ribosomal subunits under near-physiological conditions. The study elegantly bridges the gap between the non-physiological aspects of the previous two-step reconstitution method and the extract-dependent iSAT system to enable assembly of highly functional ribosomes under translation-compatible conditions. The reported findings represent progress towards achieving a bottom-up reconstruction of the translation machinery from synthetic parts.

    2. Reviewer #1 (Public review):

      Summary:

      This study presents evidence that addition of the two GTPases EngA and ObgE to reactions comprised of rRNAs and total ribosomal proteins purified from native bacterial ribosomes can bypass the requirements for non-physiological temperature shifts and Mg+2 ion concentrations for in vitro reconstitution of functional E. coli ribosomes.

      Strengths:

      This advance allows ribosome reconstitution in a fully reconstituted protein synthesis system containing individually purified recombinant translation factors, with the reconstituted ribosomes substituting for native purified ribosomes to support protein synthesis. This represents a significant development in the long-term effort to produce synthetic cells.

      Weaknesses:

      - The authors carried out additional experiments indicating that ~60% of the reconstituted ribosomes are functional and that a significant proportion are capable of synthesizing GFP from the correct initiation codon to the correct stop codon, and also of producing an enzymatically active protein at appreciable levels. Their SDS-PAGE and MS analyses of N-terminally tagged GFP are also quite useful but did not assess the frequency of initiation at the wrong start codon, termination at the incorrect stop codon, or the frequency of frameshifting during elongation. This would require examining additional reporters designed to examine dependence on a Shine-Dalgarno sequence or the impact of an in-frame stop codon to assess the fidelity of initiation and termination events, respectively, and one with a programmed frameshift site to assess the elongation fidelity of their reconstituted ribosomes.

      - Reconstitution studies in the past have succeeded by using all recombinant, individually purified RPs that, if successful here, would have eliminated the possibility that one or more unknown ribosome assembly factors that co-purify with native ribosomes was added to their reconstitution reactions.

    3. Reviewer #2 (Public review):

      This study has developed a single-step method to assemble active bacterial ribosomes under near-physiological conditions by using the GTPase factors EngA and ObgE. These factors eliminate the need for the traditional, harsh manipulations of temperature and magnesium levels. This integration is an important step toward the bottom-up construction of synthetic cells.

      Comments on revisions:

      The authors have addressed my concerns in the previous round of review.

    4. Author response:

      The following is the authors’ response to the original reviews.

      eLife Assessment

      This is a useful study presenting solid data indicating that the bacterial GTPases EngA and ObgE enable single-step reconstitution of functional 50S ribosomal subunits under near-physiological conditions. The study elegantly bridges the gap between the non-physiological aspects of the previous two-step reconstitution method and the extract-dependent iSAT system to enable ribosome assembly under translation-compatible conditions; however, it is limited by reliance on rRNA and proteins extracted from native ribosomes and does not achieve a true bottom-up reconstruction from all synthetic components. The evidence is incomplete in not characterizing the spectrum of reporter polypeptides produced and not comparing their rate and yield of synthesis from reconstituted ribosomes to that obtained with pure native ribosomes; and the impact of the study is limited by not including reporters to examine the fidelity of initiation, elongation or termination achieved with the reconstituted ribosomes.

      As described below, based on the comments from the public reviewers, we have summarized at the end of the Discussion how this study contributes toward true bottom-up reconstruction from fully synthetic components, as well as the aspects that will require further development. In addition, we have newly provided data characterizing the reporter polypeptides from multiple perspectives, demonstrating that the assembled ribosomes do not exhibit issues such as reduced fidelity (Fig. 6, 7, Supplementary Data 2, 3). We believe that these data adequately address the limitations that were pointed out in the eLife Assessment.

      Public Reviews:

      Reviewer #1 (Public review):

      This study presents evidence that the addition of the two GTPases EngA and ObgE to reactions comprised of rRNAs and total ribosomal proteins purified from native bacterial ribosomes can bypass the requirements for non-physiological temperature shifts and Mg+2 ion concentrations for in vitro reconstitution of functional E. coli ribosomes.

      Strengths:

      This advance allows ribosome reconstitution in a fully reconstituted protein synthesis system containing individually purified recombinant translation factors, with the reconstituted ribosomes substituting for native purified ribosomes to support protein synthesis. This work potentially represents an important development in the long-term effort to produce synthetic cells.

      Weaknesses:

      While much of the evidence is solid, the analysis is incomplete in certain respects that detract from the scientific quality and significance of the findings:

      (1) The authors do not describe how the native ribosomal proteins (RPs) were purified, and it is unclear whether all subassemblies of RPs have been disrupted in the purification procedure. If not, additional chaperones might be required beyond the two GTPases described here for functional ribosome assembly from individual RPs.

      Native ribosomal proteins (RPs) were prepared from native ribosomes, according to the well-established protocol described by Dr. Knud H. Nierhaus [Nierhaus, K. H. Reconstitution of ribosomes in Ribosomes and protein synthesis: A Practical Approach (Spedding G. eds.) 161-189, IRL Press at Oxford University Press, New York (1990)]. In this method, ribosome proteins are subjected to dialysis in 6 M urea buffer, a strong denaturing condition that may completely disrupt ribosomal structure and dissociate all ribosomal protein subassemblies. To make this point clear, we described the detailed ribosomal protein (RP) preparation procedure in the manuscript, rather than merely referring to the book.

      In addition, we would like to clarify one point related to this comment. The focus of the present study is to show that the presence of two factors is required for single-step ribosome reconstitution under translation-compatible, cell-free conditions. We do not intend to claim that these two factors are absolutely sufficient for ribosome reconstitution. Hence, we have revised the manuscript to more explicitly state what this work does and does not conclude.

      (2) Reconstitution studies in the past have succeeded by using all recombinant, individually purified RPs, which would clearly address the issue in the preceding comment and also eliminate the possibility that an unknown ribosome assembly factor that co-purifies with native ribosomes has been added to the reconstitution reactions along with the RPs.

      As noted in the response to the Comment (1), the focus of the present study is the requirement of the two factors for functional ribosome assembly. Therefore, we consider that it is not necessary to completely exclude the possibility that unknown ribosome assembly factors are present in the RP preparation. Nevertheless, we agree that it is important to clarify what factors, if any, are co-present in the RP fraction. To address this, we performed proteomic analysis of the TP70 preparation (Supplementary Data 3) and stated the possibility of other factors’ inclusion.

      We also agree that additional, as-yet-unidentified components, including factors involved in rRNA modification, could plausibly further improve assembly efficiency. We also consider that such studies may contribute to extending the system to the use of in vitro-transcribed rRNA and fully recombinant ribosomal proteins, which could be essentially a next step of this study. We noted the possibility of as-yet-unidentified components and the future perspectives in the Discussion.

      (3) They never compared the efficiency of the reconstituted ribosomes to native ribosomes added to the "PURE" in vitro protein synthesis system, making it unclear what proportion of the reconstituted ribosomes are functional, and how protein yield per mRNA molecule compares to that given by the PURE system programmed with purified native ribosomes.

      According to this suggestion, we measured the sfGFP synthesis rate from the increase in fluorescence over time under conditions where the template mRNA is in excess, and compared this rate directly between reconstituted and native ribosomes. We consider that this comparison provides insight into what fraction of ribosomes reconstituted in our system are functionally active (Fig. 6).

      As noted in the provisional responses, quantifying protein yield per mRNA molecule is substantially more challenging. The translation system is complex, and the apparent yield per mRNA can vary depending on factors such as differences in polysome formation efficiency. In addition, the PURE system is a coupled transcription–translation setup that starts from DNA templates, which further complicates rigorous normalization on a per-mRNA basis. Because the main focus of this study is to determine how many functionally active ribosomes can be reconstituted under translation-compatible conditions, we addressed this comment by just carrying out the experiment comparing sfGFP synthesis rate.

      (4) They also have not examined the synthesized GFP protein by SDS-PAGE to determine what proportion is full-length.

      We have added an affinity tag to the sfGFP reporter, and then, purified the synthesized products from the reaction mixture and analyzed it by SDS–PAGE (Fig. 7a).

      (5) The previous development of the PURE system included examinations of the synthesis of multiple proteins, one of which was an enzyme whose specific activity could be compared to that of the native enzyme. This would be a significant improvement to the current study. They could also have programmed the translation reactions containing reconstituted ribosomes with (i) total native mRNA and compared the products in SDS-PAGE to those obtained with the control PURE system containing native ribosomes; (ii) with specifc reporter mRNAs designed to examine dependence on a Shine-Dalgarno sequence and the impact of an in-frame stop codon in prematurely terminating translation to assess the fidelity of initiation and termination events; and (iii) an mRNA with a programmed frameshift site to assess elongation fidelity displayed by their reconstituted ribosomes.

      Following the recommendation, we selected DHFR as an enzymatically active protein and used it as a reporter, confirming that it exhibited enzymatic activity comparable to that observed when synthesized by native ribosomes (Fig. 7c). In addition, MS analysis of the purified sfGFP used for SDS-PAGE analysis showed that nearly all peptide fragments were detected, covering almost the entire sequence from the initiator amino acid to the amino acid immediately preceding the stop codon (Fig. 7b, Supplementary Data 2. These results suggest that protein synthesis by the newly assembled ribosomes proceeds smoothly from initiation to termination, with no apparent problem in fidelity, and therefore indicate that functional ribosomes were successfully reconstituted.

      Reviewer #2 (Public review):

      This study presents a significant advance in the field of in vitro ribosome assembly by demonstrating that the bacterial GTPases EngA and ObgE enable single-step reconstitution of functional 50S ribosomal subunits under near-physiological conditions-specifically at 37 {degree sign}C and with total Mg<sup>2+</sup> concentrations below 10 mM.

      This achievement directly addresses a long-standing limitation of the traditional two-step in vitro assembly protocol (Nierhaus & Dohme, PNAS 1974), which requires non-physiological temperatures (44-50 {degree sign}C), and high Mg<sup>2+</sup> concentrations (~20 mM). Inspired by the integrated Synthesis, Assembly, and Translation (iSAT) platform (Jewett et al., Mol Syst Biol 2013), leveraging E. coli S150 crude extract, which supplies essential assembly factors, the authors hypothesize that specific ribosome biogenesis factors-particularly GTPases present in such extracts-may be responsible for enabling assembly under mild conditions. Through systematic screening, they identify EngA and ObgE as the minimal pair sufficient to replace the need for temperature and Mg<sup>2+</sup> shifts when using phenol-extracted (i.e., mature, modified) rRNA and purified TP70 proteins.

      However, several important concerns remain:

      (1) Dependence on Native rRNA Limits Generalizability

      The current system relies on rRNA extracted from native ribosomes via phenol, which retains natural post-transcriptional modifications. As the authors note (lines 302-304), attempts to assemble active 50S subunits using in vitro transcribed rRNA, even in the presence of EngA and ObgE, failed. This contrasts with iSAT, where in vitro transcribed rRNA can yield functional (though reduced-activity, ~20% of native) ribosomes, presumably due to the presence of rRNA modification enzymes and additional chaperones in the S150 extract. Thus, while this study successfully isolates two key GTPase factors that mimic part of iSAT's functionality, it does not fully recapitulate iSAT's capacity for de novo assembly from unmodified RNA. The manuscript should clarify that the in vitro assembly demonstrated here is contingent on using native rRNA and does not yet achieve true bottom-up reconstruction from synthetic parts. Moreover, given iSAT's success with transcribed rRNA, could a similar systematic omission approach (e.g., adding individual factors) help identify the additional components required to support unmodified rRNA folding?

      We fully recognize the reviewer’s point that our current system has not yet achieved a true bottom-up reconstruction. Although we intended to state this clearly in the manuscript, the fact that this concern remains indicates that our description was not sufficiently explicit. We therefore added the paragraph to ensure that this limitation is clearly communicated to readers.

      (2) Imprecise Use of "Physiological Mg<sup>2+</sup> Concentration"

      The abstract states that assembly occurs at "physiological Mg<sup>2+</sup> concentration" (<10 mM). However, while this total Mg<sup>2+</sup> level aligns with optimized in vitro translation buffers (e.g., in PURE or iSAT systems), it exceeds estimates of free cytosolic [Mg<sup>2+</sup>] in E. coli (~1-2 mM). The authors should clarify that they refer to total Mg<sup>2+</sup> concentrations compatible with cell-free protein synthesis, not necessarily intracellular free ion levels, to avoid misleading readers about true physiological relevance.

      We agree that this is a very reasonable point and revised the manuscript to clarify that we are referring to the total Mg<sup>2+</sup> concentration compatible with cell-free protein synthesis, rather than the intracellular free Mg<sup>2+</sup> level under physiological conditions. We also changed the term “physiological” to “near-physiological” to avoid the misunderstanding.

      In summary, this work elegantly bridges the gap between the two-step method and the extract-dependent iSAT system by identifying two defined GTPases that capture a core functionality of cellular extracts: enabling ribosome assembly under translation-compatible conditions. However, the reliance on native rRNA underscores that additional factors - likely present in iSAT's S150 extract - are still needed for full de novo reconstitution from unmodified transcripts. Future work combining the precision of this defined system with the completeness of iSAT may ultimately realize truly autonomous synthetic ribosome biogenesis.

      Recommendations for the authors:

      Reviewing Editor Comments:

      Recommendations for improvement:

      (1) Assess the length distribution of GFP polypeptides being produced using SDS-PAGE.

      SDS-PAGE was performed according to the comment 4 of the Reviewer #1 (Fig. 7b). Please refer to our response addressing the comment.

      (2) Compare the rate and yield of GFP synthesized per mRNA using their reconstituted ribosomes to that obtained with pure native ribosomes.

      The efficiency of the reconstituted ribosomes was compared to native ribosomes according to the comment 3 of the Reviewer #1 (Fig. 6). Please refer to our response addressing the comment.

      (3) Expand the panel of reporter mRNAs being examined to compare the fidelity of initiation, elongation or termination achieved with reconstituted ribosomes to that obtained using native ribosomes.

      DHFR synthesis was addressed and also MS analysis of synthesized sfGFP was performed according to the comment 5 of the Reviewer #1 (Fig. 7b, c). Please refer to our response addressing the comment.

      (4) Revise the manuscript to clarify that the in vitro assembly demonstrated here is contingent on using native rRNA and thus does not achieve a true bottom-up reconstruction from synthetic parts.

      We added to the Discussion a paragraph summarizing the findings of this study, limitations, and future perspectives according to the comment 1 and 2 of the Reviewer #1 and the comment 1 of the Reviewer #2. Please refer to our responses addressing these comments.

      (5) Revise the manuscript to clarify that they are referring to total Mg2+ concentrations compatible with cell-free protein synthesis, not necessarily intracellular free ion levels, to avoid misleading readers about the physiological relevance of the reconstitution.

      We revised the manuscript to clarify this point according to the comment 2 of the Reviewer #2. Please refer to our response addressing the comment.

      (6) Revise the text to fully describe how the native ribosomal proteins (RPs) were purified and indicate whether all subassemblies of RPs were disrupted in the purification procedure.

      We revised the Methods section to clarify how the native RPs were purified and that all subassemblies of RPs were disrupted according to the comment 1 of the Reviewer #1.

      (7) Revise the text to indicate that achieving ribosome reconstitutions using all recombinant, individually purified RPs is required to achieve a true bottom-up reconstruction from all synthetic components.

      As with our response to the comment 4, we have added the point at the end of the Discussion as a future perspective toward true bottom-up reconstruction from all synthetic components.

      (8) Consider conducting a similar systematic omission approach (e.g., adding individual factors) to help identify the additional components required to support unmodified rRNA folding.

      As with our response to the comment 4 and 7, we have added the point at the end of the Discussion as a future perspective toward identification of additional essential factors for true bottom-up reconstruction.

      Reviewer #1 (Recommendations for the authors):

      (1) Assessing the spectrum of GFP polypeptides being produced by SDS-PAGE and comparing the rate and yield of GFP produced to that obtained with pure native ribosomes would seem to be essential additional measurements needed to bolster the evidence supporting the main conclusions of the work.

      SDS-PAGE and MS analysis of the synthesized sfGFP were performed (Fig. 7a, b). Comparison of the assembled ribosomes and native ones were also performed (Fig. 6).

      (2) Examining translation of other reporter mRNAs designed to compare the fidelity of initiation, elongation or termination achieved with reconstituted ribosomes to that produced by native ribosomes in the PURE system would be required to elevate the scientific quality of the work and its significance to the field.

      DHFR synthesis and its activity measurement were performed (Fig. 7c). Also, MS analysis of the purified sfGFP showed that nearly all peptide fragments were detected, covering almost the entire sequence from the initiator amino acid to the amino acid immediately preceding the stop codon (Fig. 7b). We consider that these findings indicate that there is no apparent problem with fidelity.

    1. eLife Assessment

      This is an important study that develops multiple human iPSC-based models to study the consequences of DNMT3A mutations in Tatton-Brown-Rahman Syndrome. Convincing evidence shows dysregulation of GABAergic interneuron development and function, and the authors identify some of the key signaling mechanisms underlying these changes. This study will be of interest for understanding the functions of DNMT3A in brain development and the causes of neurological dysfunction in Tatton-Brown-Rahman Syndrome.

    2. Reviewer #1 (Public review):

      Summary:

      This is an important study that describes the consequences of the DNMT3A mutation in human neuronal development for the first time. The selective impact of DNMT3A function on GABAergic interneurons is interesting and an important feature of future therapeutics. The claims made in that manuscript are supported by strong evidence for the most part. And the data are of high quality in general and presented well.

      Strengths:

      The strengths of the work include: Characterization of multiple DNMT3A loss-of-function alleles, including two misense variants, R882H, P904L, and a deletion allele. The missense mutation lines both include an ideal control with the same genetic background. The CRISPRi-mediated DNMT3A knockdown has also been included. The study identifies the mTOR-PI3K pathway as a factor of overgrowth issues found in the mutant organoid. In bulk mRNA sequencing and whole-genome bisulfite sequencing, identify hypomethylated genomic regions associated with gene expression repression. Again, this is more pronounced in the ventral organoid compared to the dorsal organoid. In addition, the extensive electrophysiological characterizations with a high-density microelectrode array support the more mature status of mutant interneurons.

      Weaknesses:

      Although a strong study overall, some weaknesses are noted. These include:

      (1) The lack of validation data for the generated iPSCs and hESCs, such as the chromosomal contents, ploidy, and pluripotency states.

      (2) Other weaknesses relate to data interpretation and insufficient discussion of related matters, as detailed in the recommendations to the authors.

      (3) Also, some errors are noted and detailed in the recommendation section.

    3. Reviewer #2 (Public review):

      Summary:

      Chapman, Determan et al. investigate how pathogenic mutations in DNMT3A, which cause Tatton-Brown-Rahman Syndrome (TBRS), disrupt human cortical developmental processes using a comprehensive panel of human pluripotent stem cell models spanning DNMT3A loss-of-function severity. The authors aim to identify the cellular and molecular mechanisms underlying TBRS-associated brain overgrowth and intellectual disability, and to test whether mechanistic convergence exists between TBRS and other overgrowth-intellectual disability disorders (OGIDs) caused by mutations in EZH2 (Weaver syndrome) or PIK3CA pathway components. Their central conclusion is that GABAergic interneuron development is selectively vulnerable to DNMT3A mutation, where reduced DNA methylation causes premature de-repression of neuronal and synaptic genes, driving precocious neuronal maturation and hyperactivity sufficient to disrupt neuronal network synchrony. This report adds to a growing literature supporting the vulnerability of GABAergic interneurons in NDDs and further provides a mechanistic view of this vulnerability, potentially convergent across OGIDs. The mechanistic claims around H3K27me3 compensation and mTOR-based therapeutic convergence, while promising, rest on more preliminary evidence and would benefit from the distinction between correlation and mechanism being made more explicit in the text. Overall, this is a compelling study with a rigorous experimental design and novel findings with a potential impact on a better understanding of the OGID pathophysiology.

      Strengths:

      (1) A major strength of this work is the breadth and rigor of the disease modeling approach. Four independent TBRS model systems are used in tandem: a patient-derived iPSC line with isogenic CRISPR-corrected control (R882H), a knock-in hESC model (P904L) with its wild-type isogenic, patient deletion iPSC lines (Del1/2), and CRISPRi knockdown models (G1/G2), collectively spanning a range of DNMT3A loss-of-function that correlates with phenotypic severity. This allelic series design substantially strengthens causal inference beyond what any single isogenic pair could provide.

      (2) The multi-omic integration across matched developmental stages provides a strong mechanistic foundation for the cellular phenotyping and provides significantly enhanced novelty. RNA-seq, whole-genome bisulfite sequencing, and H3K27me3 CUT&Tag are combined in the same cell types, and timepoints show that DNMT3A loss reduces CG methylation at neuronal and synaptic gene loci, leading to premature transcriptional activation.

      (3) The selective vulnerability of ventral (GABAergic) versus dorsal (glutamatergic) progenitors is one of the study's most important findings. This lineage specificity is consistently observed across all model systems and in both 2D and organoid formats, where ventral NPCs show increased proliferation, premature neuronal gene expression, and increased neurogenesis, while dorsal NPCs are largely unaffected at the transcriptomic and cellular level despite exhibiting comparable DNA methylation changes. This adds to a body of emerging work showing GABAergic interneuron vulnerability in NDDs where ubiquitously expressed genes such as chromatin modifiers are perturbed, and provides additional molecular insights into potential mechanisms of "resilience" of dorsal populations.

      (4) The functional characterization follows a logical progression from single-neuron electrophysiology (demonstrating GABAergic hyperactivity with increased action potential amplitude and firing rate) to network-level analysis using high-density multi-electrode arrays. The HD-MEA experimental design - pairing TBRS or control GABAergic neurons with a constant background of control iGlut neurons - cleanly isolates GABAergic dysfunction as the driver of network hypersynchrony.

      Weaknesses:

      (1) The concomitant induction of proliferation and differentiation in TBRS V-NPCs is conceptually striking, since these are generally considered antagonistic developmental programs. The authors partially address this tension by noting that DNMT3A LOF alone is insufficient to initiate neuronal differentiation, i.e., V-NPCs upregulate neuronal and synaptic genes while retaining progenitor identity, implying that transcriptomic priming and commitment to differentiation are decoupled. However, the relationship between the proliferative phenotype and the epigenetic priming phenotype remains mechanistically unresolved. The manuscript documents mTOR pathway upregulation at the protein level and identifies shared DEGs that include proliferative regulators, but it does not establish whether mTOR-driven proliferation and mCG-loss-driven neuronal gene de-repression/enhanced differentiation are causally linked or represent two independent consequences of DNMT3A LOF.

      (2) Relatedly, the rapamycin rescue experiment is a valuable proof-of-concept for the PIK3/AKT/mTOR convergence but is limited to a single dose in a single model (882) with a single readout (Ki67+ proliferation). Given the prominence of mTOR pathway convergence in the manuscript as a potential shared therapeutic avenue across OGIDs, the data supporting this claim are somewhat preliminary. It remains unknown whether mTOR inhibition rescues downstream phenotypes (neurogenesis, gene expression, neuronal maturation) or whether less severe TBRS models respond similarly. This might also help tackle the first comment above. e.g., if mTOR inhibition rescued proliferation but not the transcriptomic priming, that would support two independent mechanisms.

      (3) The claim that H3K27me3 compensates for mCG loss is an important mechanistic point, but the current data do not distinguish between active compensation, in which EZH2 is recruited in response to methylation loss, and functional redundancy, in which H3K27me3 is independently established and becomes the dominant repressive mark once DNA methylation is reduced. The EZH2 knockdown/inhibition experiments show that H3K27me3 is sufficient to maintain repression at hypo-DMR sites, but they do not establish that H3K27me3 gain is itself a response to methylation loss. Because H3K27me3 profiling was performed only in the severe 882 model, it is also unclear whether H3K27me3 gain scales with DNMT3A LOF severity, as a compensatory model would predict. Finally, the EZH2 overexpression rescue is performed in V-NPCs, whereas the compensation model is developed primarily in D-NPCs, making it difficult to assess whether the same mechanism operates in the lineage where it was originally inferred.

      (4) The narrative framing of dorsal neuron development as unaffected by DNMT3A LOF is somewhat at odds with the data presented. The 882 D-NPCs show substantial DNA methylation changes, and TBRS D-INs exhibit what the authors describe as "substantive transcriptomic differences" involving persistent expression of pluripotency and progenitor genes, which seems to be a distinct but potentially significant phenotype. The impact of DNMT3A loss between ventral and dorsal lineages might be more accurately framed as divergent in nature rather than specific to a certain population.

      (5) SST stainings are not entirely convincing. They appear mostly nuclear, and some instances localized to rosettes in organoids, whereas the protein is largely confined to processes and is expected to be found outside progenitor-rich zones like rosettes.

    4. Reviewer #3 (Public review):

      Summary:

      In this manuscript, the authors investigated TBRS etiology by using new human pluripotent stem cell models, modeling varying levels of TBRS-associated loss of DNMT3A function. They identified increased lineage-specific proliferation of precursors in TBRS ventral MGE-like progenitors, which they propose was related to increased signaling through the PIK3/AKT/mTOR pathway. Furthermore, they show that reduced DNA methylation during MGE-like progenitor differentiation into GABAergic interneurons can cause a premature expression of neuronal and synaptic genes, triggering precocious neuronal maturation. In conclusion, they propose that TBRS-derived GABAergic neurons exhibit hyperactivity that can alters the development and structure of neuronal networks.

      Strengths:

      Overall, the data presented is convincing, from an early developmental point of view, given that the iPSC-derived 2D cultures or organoids used do not get to reach a mature state. Nonetheless, the data clearly show the effects that deleterious mutations in TBRS can cause during the period of neurogenesis, which was missing in the field.

      Weaknesses:

      (1) Li et al., 2022 (referred to in the manuscript) seems to already show the interplay between H3K27me3 and Dnmt3a discussed in this study i.e., that in the absence of DNA methylation, there is an expansion of polycomb-like repression. These data should be better acknowledged in the paragraph 'Repressive H3K27me3 compensates for severe loss of DNA methylation' (page 9), given it supports the data presented in this manuscript and suggests this as a common mechanism in the interplay between these two repressive marks, as it is well established in the literature.

      (2) The authors should acknowledge that the omics data come from a mixed population of cells.

      (3) The authors are encouraged to further discuss whether the overgrowth observed in ventral GABAergic cultures or organoids compares to the overgrowth observed in diseased patients. One expects MRIs to have been performed in patients and that these could be harnessed to discern if overgrowth occurs in the cortex or ventral regions of the brain.

    5. Author response:

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      This is an important study that describes the consequences of the DNMT3A mutation in human neuronal development for the first time. The selective impact of DNMT3A function on GABAergic interneurons is interesting and an important feature of future therapeutics. The claims made in that manuscript are supported by strong evidence for the most part. And the data are of high quality in general and presented well.

      Strengths:

      The strengths of the work include: Characterization of multiple DNMT3A loss-of-function alleles, including two misense variants, R882H, P904L, and a deletion allele. The missense mutation lines both include an ideal control with the same genetic background. The CRISPRi-mediated DNMT3A knockdown has also been included. The study identifies the mTOR-PI3K pathway as a factor of overgrowth issues found in the mutant organoid. In bulk mRNA sequencing and whole-genome bisulfite sequencing, identify hypomethylated genomic regions associated with gene expression repression. Again, this is more pronounced in the ventral organoid compared to the dorsal organoid. In addition, the extensive electrophysiological characterizations with a high-density microelectrode array support the more mature status of mutant interneurons.

      Weaknesses:

      Although a strong study overall, some weaknesses are noted. These include:

      (1) The lack of validation data for the generated iPSCs and hESCs, such as the chromosomal contents, ploidy, and pluripotency states.

      We thank the reviewer for their constructive feedback. We previously validated our 882 models with whole genome sequencing and teratoma formation upon mouse fat pad injection, while the parental human embryonic stem cell line (WA01 hESCs) used for P904L variant knock-in was validated by our Genome Engineering Stem Cell (GESC) core upon derivation of that variant knock-in model. We have now added both karyotyping and pluripotency staining (SOX2/OCT4) for all other hPSC lines as (new) Supplementary Figure S17 and included further description in our Methods section under “hPSC Model Generation and Culture”.

      New Data: Supplemental Figure S17 (SOX2/OCT4 staining in hPSCs and karyotyping of all lines used)

      Text edits: Additional language confirming hPSC line validation will be added to the Methods section under “hPSC Model Generation and Culture” on page 18.

      (2) Other weaknesses relate to data interpretation and insufficient discussion of related matters, as detailed in the recommendations to the authors.

      We thank the reviewer for their insightful suggestions and have detailed our responses in the “recommendations to the authors” section.

      (3) Also, some errors are noted and detailed in the recommendation section.

      We thank the reviewer for catching these errors and have since corrected them, with detailed responses below.

      Reviewer #2 (Public review):

      Summary:

      Chapman, Determan et al. investigate how pathogenic mutations in DNMT3A, which cause Tatton-Brown-Rahman Syndrome (TBRS), disrupt human cortical developmental processes using a comprehensive panel of human pluripotent stem cell models spanning DNMT3A loss-of-function severity. The authors aim to identify the cellular and molecular mechanisms underlying TBRS-associated brain overgrowth and intellectual disability, and to test whether mechanistic convergence exists between TBRS and other overgrowth-intellectual disability disorders (OGIDs) caused by mutations in EZH2 (Weaver syndrome) or PIK3CA pathway components. Their central conclusion is that GABAergic interneuron development is selectively vulnerable to DNMT3A mutation, where reduced DNA methylation causes premature de-repression of neuronal and synaptic genes, driving precocious neuronal maturation and hyperactivity sufficient to disrupt neuronal network synchrony. This report adds to a growing literature supporting the vulnerability of GABAergic interneurons in NDDs and further provides a mechanistic view of this vulnerability, potentially convergent across OGIDs. The mechanistic claims around H3K27me3 compensation and mTOR-based therapeutic convergence, while promising, rest on more preliminary evidence and would benefit from the distinction between correlation and mechanism being made more explicit in the text. Overall, this is a compelling study with a rigorous experimental design and novel findings with a potential impact on a better understanding of the OGID pathophysiology.

      Strengths:

      (1) A major strength of this work is the breadth and rigor of the disease modeling approach. Four independent TBRS model systems are used in tandem: a patient-derived iPSC line with isogenic CRISPR-corrected control (R882H), a knock-in hESC model (P904L) with its wild-type isogenic, patient deletion iPSC lines (Del1/2), and CRISPRi knockdown models (G1/G2), collectively spanning a range of DNMT3A loss-of-function that correlates with phenotypic severity. This allelic series design substantially strengthens causal inference beyond what any single isogenic pair could provide.

      (2) The multi-omic integration across matched developmental stages provides a strong mechanistic foundation for the cellular phenotyping and provides significantly enhanced novelty. RNA-seq, whole-genome bisulfite sequencing, and H3K27me3 CUT&Tag are combined in the same cell types, and timepoints show that DNMT3A loss reduces CG methylation at neuronal and synaptic gene loci, leading to premature transcriptional activation.

      (3) The selective vulnerability of ventral (GABAergic) versus dorsal (glutamatergic) progenitors is one of the study's most important findings. This lineage specificity is consistently observed across all model systems and in both 2D and organoid formats, where ventral NPCs show increased proliferation, premature neuronal gene expression, and increased neurogenesis, while dorsal NPCs are largely unaffected at the transcriptomic and cellular level despite exhibiting comparable DNA methylation changes. This adds to a body of emerging work showing GABAergic interneuron vulnerability in NDDs where ubiquitously expressed genes such as chromatin modifiers are perturbed, and provides additional molecular insights into potential mechanisms of "resilience" of dorsal populations.

      (4) The functional characterization follows a logical progression from single-neuron electrophysiology (demonstrating GABAergic hyperactivity with increased action potential amplitude and firing rate) to network-level analysis using high-density multi-electrode arrays. The HD-MEA experimental design - pairing TBRS or control GABAergic neurons with a constant background of control iGlut neurons - cleanly isolates GABAergic dysfunction as the driver of network hypersynchrony.

      Weaknesses:

      (1) The concomitant induction of proliferation and differentiation in TBRS V-NPCs is conceptually striking, since these are generally considered antagonistic developmental programs. The authors partially address this tension by noting that DNMT3A LOF alone is insufficient to initiate neuronal differentiation, i.e., V-NPCs upregulate neuronal and synaptic genes while retaining progenitor identity, implying that transcriptomic priming and commitment to differentiation are decoupled. However, the relationship between the proliferative phenotype and the epigenetic priming phenotype remains mechanistically unresolved. The manuscript documents mTOR pathway upregulation at the protein level and identifies shared DEGs that include proliferative regulators, but it does not establish whether mTOR-driven proliferation and mCG-loss-driven neuronal gene de-repression/enhanced differentiation are causally linked or represent two independent consequences of DNMT3A LOF.

      We thank the reviewer for their comment and agree that this phenotype, whereby progenitors exhibited both increased proliferation and hallmarks of gene expression associated with neuronal differentiation is striking and interesting, given that these are typically antagonistic paradigms during normal development.

      We documented that these phenotypes involve upregulated expression of both neuronal/synaptic and proliferative genes in V-NPCs (Figure 2d), with concomitant loss of repressive DNA methylation at regulatory elements associated with these genes (Figure 2f, Supplemental Data 5). In this work, DNMT3A mutation had a more prominent role in de-repressing neuronal and synaptic gene expression to promote hallmarks of neuron differentiation, while playing a relatively less central role in direct regulation of proliferation genes, as seen from the relative prominence of neuronal/synaptic- versus proliferation-related GO terms in our Supplemental Data 5 table.

      To examine the mechanisms underlying increased V-NPC proliferation in our TBRS models, we assessed a potential relationship with the PIK3/AKT/mTOR pathway, as this is implicated in increased proliferation resulting from DNMT3A-associated mutation in myeloid leukemia (Dai et al., 2017, PMID: 28461508). In our work, DNMT3A mutation increased the expression and/or phosphorylation of mTOR signaling pathway targets specifically in V-NPCs (Figure 1q-r, Supplemental Figure S3a-d). However, while TBRS mutation directly affected repressive DNA methylation at a suite of cell proliferation-related genes, these did not include the PIK3/AKT/mTOR pathway genes themselves, suggesting an indirect relationship between altered DNA methylation and increased mTOR signaling.

      Text Edits: We will incorporate further discussion of how DNMT3A-mediated gene repression and levels of PIK3/AKT/mTOR pathway signaling may be interacting, providing a framework for future studies to identify how these related OGID gene mutations may converge mechanistically.

      (2) Relatedly, the rapamycin rescue experiment is a valuable proof-of-concept for the PIK3/AKT/mTOR convergence but is limited to a single dose in a single model (882) with a single readout (Ki67+ proliferation). Given the prominence of mTOR pathway convergence in the manuscript as a potential shared therapeutic avenue across OGIDs, the data supporting this claim are somewhat preliminary. It remains unknown whether mTOR inhibition rescues downstream phenotypes (neurogenesis, gene expression, neuronal maturation) or whether less severe TBRS models respond similarly. This might also help tackle the first comment above. e.g., if mTOR inhibition rescued proliferation but not the transcriptomic priming, that would support two independent mechanisms.

      We thank the reviewer for their comment. We explored both the overall levels and phosphorylation of proteins involved in PIK3/AKT/mTOR signaling in the 882, 904, Del1, Del2, and KO V-NPC models (Figure 1q-r, Supplementary Figure S3a-d), finding specific increases of all proteins. We showed that rapamycin addition reversed the increased proportion of KI67+ proliferating cell nuclei resulting from 882 mutation in V-NPCs in main Figure 1s, while demonstrating that rapamycin also reduced the proportion of KI67+ nuclei observed in both less severe 904 and Del1 V-NPC models (Supplementary Figure S3e-f).

      We agree that understanding whether rapamycin treatment can rescue TBRS neuronal phenotypes would be very interesting, as previous work on Tuberous Sclerosis Complex has utilized rapamycin and other mTOR inhibitors to effectively reverse TSC-related alterations of neuronal morphology and neuronal hyperexcitability (Buttermore et al., 2025, PMID: 40792287). Future studies examining convergent mechanisms and therapeutics for OGIDs should examine how similarly targeting this and related pathways rescues altered neuronal morphology, maturation, and function, as we have demonstrated that TBRS mutation has subsequent consequences for V-IN differentiation, maturation, and function. This point has been detailed in the discussion section on pages 15-16.

      (3) The claim that H3K27me3 compensates for mCG loss is an important mechanistic point, but the current data do not distinguish between active compensation, in which EZH2 is recruited in response to methylation loss, and functional redundancy, in which H3K27me3 is independently established and becomes the dominant repressive mark once DNA methylation is reduced. The EZH2 knockdown/inhibition experiments show that H3K27me3 is sufficient to maintain repression at hypo-DMR sites, but they do not establish that H3K27me3 gain is itself a response to methylation loss. Because H3K27me3 profiling was performed only in the severe 882 model, it is also unclear whether H3K27me3 gain scales with DNMT3A LOF severity, as a compensatory model would predict. Finally, the EZH2 overexpression rescue is performed in V-NPCs, whereas the compensation model is developed primarily in D-NPCs, making it difficult to assess whether the same mechanism operates in the lineage where it was originally inferred.

      We thank the reviewer for the opportunity to clarify our findings and experimental reasoning. A previous study using a conditional Dnmt3a knockout mouse model (Li et al., 2022, PMID: 35604009) demonstrated increased expression of multiple PRC2 components following the loss of Dnmt3a. This study demonstrated that sites which lost DNA methylation gained H3K27me3 in postnatal neurons upon Dnmt3a loss. Therefore, we hypothesize that the gain of H3K27me3 likely occurs in response to loss of DNMT3A methylation.

      While we did not perform CUT&Tag for H3K27me3 in our less severe models, we did validate gene expression changes following EZH2 knockdown and inhibition in both the R882H (Figure 4g-h) and P904L (Supplementary Figure S8b) models, finding that gene expression was unchanged in the model with the less severe DNMT3A mutation (P904L). Based upon these findings, we hypothesized that compensatory H3K27me3 may occur only upon severe DNMT3A loss, as seen in the dominant-negative R882H model. Furthermore, as H3K27me3 compensation was more prominent in D-NPCs, we hypothesized that this might be sufficient to prevent de-repression and aberrant neuronal gene repression upon loss of DNMT3A-mediated repression in D-NPCs. However, since TBRS mutation caused the most prominent de-repression of neuronal gene expression in V-NPCs, we also tested whether EZH2 overexpression could reverse this, finding that it partially suppressed this dysregulated neuronal gene expression. To better clarify this logic and the findings, we will make text edits to this results section.

      Text edits: We will clarify the reasoning for performing the EZH2 overexpression experiments in V-NPCs and reference Li et al., 2022 in both the results (pg. 9-10) and discussion.

      (4) The narrative framing of dorsal neuron development as unaffected by DNMT3A LOF is somewhat at odds with the data presented. The 882 D-NPCs show substantial DNA methylation changes, and TBRS D-INs exhibit what the authors describe as "substantive transcriptomic differences" involving persistent expression of pluripotency and progenitor genes, which seems to be a distinct but potentially significant phenotype. The impact of DNMT3A loss between ventral and dorsal lineages might be more accurately framed as divergent in nature rather than specific to a certain population.

      We thank the reviewer for their comment. While TBRS mutations appear to have a significantly stronger effect on V-NPCs and subsequently V-INs, both transcriptomic and methylation alterations do also occur upon TBRS mutation in D-NPCs and D-INs, as noted in Supplemental Figure S4d, S11, and Supplemental Data 2. However, we observed substantially greater molecular alterations in V-NPCs/V-INs, a lack of overt cellular phenotypes in D-NPCs where assayed, and a lack of functional consequences in matured D-INs, suggesting a more significant requirement for DNMT3A in regulating the differentiation and subsequent maturation of cortical inhibitory interneurons during embryonic and early pre-natal development, the developmental periods that we can readily model in hPSC-derived neurons.

      It should also be noted that these hPSC differentiation models do not recapitulate post-natal deposition of non-CpG (mCA) DNA methylation, a mechanism disrupted postnatally by TBRS-associated mutations in our prior work in murine models (Harrison Gabel; e.g. Beard et al., 2023, PMID: 37952155). Therefore, we hypothesize that if we could sufficiently mature D-INs to a state that modeled postnatal development and recapitulated this non-CpG methylation, we might be able to detect cellular and functional phenotypes in later stage D-INs. To avoid misinterpretation, we will alter the language in the results section to confirm that there are both transcriptomic and methylation changes in our D-NPCs/D-INs, but that these are not accompanied by cellular phenotypes or neuronal dysfunction.

      Text edits: We will better clarify that there are transcriptomic and methylation changes in D-NPCs/D-INs, but that these changes are minimal compared to those in V-NPCs/V-INs, as supported by the lack of cellular and functional phenotypes seen in D-NPCs/D-INs.

      (5) SST stainings are not entirely convincing. They appear mostly nuclear, and some instances localized to rosettes in organoids, whereas the protein is largely confined to processes and is expected to be found outside progenitor-rich zones like rosettes.

      We agree that the perinuclear SST staining detected in these young ventral telencephalic-patterned organoids at day 30 differs somewhat from the more process-localized and cytosolic signal seen in later stage organoids in other studies. This may be related to the use of different commercial SST antibodies across studies but also likely reflects SST immunoreactivity in newborn neurons near the onset of SST expression. For example, immature SST-immunoreactive neurons in the early postnatal rat cortex exhibit predominant SST staining in perinuclear cytoplasm and short processes (e.g. Fig. 3 in Lee et al, PMID: 9664223) while acquiring more cytosolic and process-localized staining as postnatal neuron maturation occurs. Evaluation of immunopositivity for other markers of neurogenesis (ASCL1) and immature neurons (TUJ1) is also congruent with these findings for SST, with TBRS-associated mutations increasing in the fraction of cells in V-NPCs/V-ORGs that express these three markers.

      Reviewer #3 (Public review):

      Summary:

      In this manuscript, the authors investigated TBRS etiology by using new human pluripotent stem cell models, modeling varying levels of TBRS-associated loss of DNMT3A function. They identified increased lineage-specific proliferation of precursors in TBRS ventral MGE-like progenitors, which they propose was related to increased signaling through the PIK3/AKT/mTOR pathway. Furthermore, they show that reduced DNA methylation during MGE-like progenitor differentiation into GABAergic interneurons can cause a premature expression of neuronal and synaptic genes, triggering precocious neuronal maturation. In conclusion, they propose that TBRS-derived GABAergic neurons exhibit hyperactivity that can alters the development and structure of neuronal networks.

      Strengths:

      Overall, the data presented is convincing, from an early developmental point of view, given that the iPSC-derived 2D cultures or organoids used do not get to reach a mature state. Nonetheless, the data clearly show the effects that deleterious mutations in TBRS can cause during the period of neurogenesis, which was missing in the field.

      Weaknesses:

      (1) Li et al., 2022 (referred to in the manuscript) seems to already show the interplay between H3K27me3 and Dnmt3a discussed in this study i.e., that in the absence of DNA methylation, there is an expansion of polycomb-like repression. These data should be better acknowledged in the paragraph 'Repressive H3K27me3 compensates for severe loss of DNA methylation' (page 9), given it supports the data presented in this manuscript and suggests this as a common mechanism in the interplay between these two repressive marks, as it is well established in the literature.

      We thank the reviewer for this suggestion and will incorporate this reference into both the results and the discussion when discussing the respective roles of DNMT3A and PCR2-mediated repression.

      Text edits: We will add Li et al., 2022 to both the results section (pg. 9-10) and our discussion section.

      (2) The authors should acknowledge that the omics data come from a mixed population of cells.

      We thank the reviewer for their comment. We have validated that the established 2-D differentiation methods we used in this study generate cell populations with >85-90% enrichment for the desired progenitor and neuronal cell type, based upon marker expression, but acknowledge that these are bulk -omics data obtained from cells that may represent a mixed population and have now detailed this in the methods section under “Sequencing”.

      Text edits: we will add language acknowledging that our omics data (bulk) was generated from mixed populations of cells.

      (3) The authors are encouraged to further discuss whether the overgrowth observed in ventral GABAergic cultures or organoids compares to the overgrowth observed in diseased patients. One expects MRIs to have been performed in patients and that these could be harnessed to discern if overgrowth occurs in the cortex or ventral regions of the brain.

      We thank the reviewer for their suggestion and do note that at least one published study documents increased cortical thickness in the MRIs of TBRS patients (Jiménez de la Peña et al., 2024, PMID: 37795572); however, to our knowledge studies have not examined regional or cell type-selective overgrowth of cortical tissue in TBRS patients. Future clinical studies examining the nature of the neuronal progenitor overgrowth and resulting consequences for patient brain imaging would be of interest to better understand TBRS-associated etiology of brain overgrowth and its manifestations.

    1. eLife Assessment

      This is a useful study investigating the role of peristalsis in the elongation of the gut, using the chick ceca as a model. The work employs optogenetics together with embryological approaches to establish links between peristaltic muscle contractions and downstream cell behaviors that lead to tube elongation. However, the work is somewhat incomplete, limited in mechanistic insights that would extend beyond prior work in the literature, which has already suggested a role for smooth muscle contractility in avian gut elongation.

    2. Reviewer #1 (Public review):

      Kawamura et al. investigated the role of circumferential smooth muscle contractions in chick gut tube elongation, addressing the hypothesis that "peristaltic activity generated by the gut promotes its own elongation during embryogenesis". Although not acknowledged in the current manuscript, this interesting premise was, in fact, previously demonstrated.

      Indeed, the experiments in the present manuscript closely parallel a previous study (Khalipina et al, 2019: "Smooth muscle contractility causes the gut to grow anisotropically") that also cultured chick gut tissue and performed time-lapse analyses to quantify peristalsis. Both studies showed that inhibiting peristalsis with Ca-channel blockers induces a switch from elongational to radial growth in the gut.

      However, one of the main strengths of the current study is the innovative use of optogenetic manipulation to rescue gut lengthening in drug-inhibited gut tissue by re-stimulating peristaltic contractions. In addition, the authors use aphidicolin to show that peristalsis-mediated gut elongation is independent of cell division. They also track individual smooth muscle cells and show that they divide circumferentially, but become redistributed along the length of the gut tube with peristalsis.

      While these data are solidly quantitative, they do not provide mechanistic insight into how peristaltic contractions cause smooth muscle cells to be redistributed.

      The evidence presented in this manuscript supports the main conclusion that peristalsis plays a critical role in embryonic gut elongation, but this conclusion itself is not novel. In addition to corroborating previous work, this manuscript provides some useful additions to our existing knowledge of the role of mechanical forces in embryonic gut morphogenesis and illustrates the utility of a previously published optogenetic manipulation technique.

    3. Reviewer #2 (Public review):

      Summary:

      This study uses the chicken caecum ex vivo culture to show that embryonic peristaltic activity is a key mechanical factor for gut elongation. It is shown that pharmacological inhibition arrests intestinal growth, while optogenetic restoration rescues longitudinal elongation. The authors propose a two-step mechanism in which circular smooth muscle cells proliferate circumferentially, but peristalsis pushes them toward longitudinal rearrangement, which explains the anisotropic growth of the gut.

      Strengths:

      The experiments combine loss-of-function (peristalsis inhibition) with gain-of-function (optogenetic rescue) experiments and quantifiable readouts in an embryonic gut culture model. The work is clearly presented with nice microscopy videos and offers a potentially valuable conceptual framework linking tissue-scale mechanics to smooth muscle cell behaviors during development.

      Weaknesses:

      Some results appear conceptually inconsistent with the claim of peristalsis-essential rearrangement (e.g., longitudinal separation of daughter cells even without peristalsis), and the mechanistic link would benefit from clearer quantification and reconciliation. The study largely overlooks contributions from other gut layers and the ECM (and aphidicolin affects all proliferating cells), limiting interpretation of how smooth muscle rearrangement translates into whole-wall elongation.

    4. Reviewer #3 (Public review):

      Summary:

      The authors noted a steep increase in the rate of growth with the onset of more frequent peristaltic-like movements and hypothesized that peristaltic activity rearranges the orientation of cell growth from circumferential to longitudinal. This study sought to alter peristalsis and then (1) carefully examine the growth of the chick cecum relative to the frequency of peristaltic-like movements and (2) examine the orientation of cells relative to the circumferential and longitudinal axes to determine whether peristalsis is required for cecum lengthening. To alter peristaltic-like movements, contraction was inhibited through treatment with nifedipine (a calcium channel blocker that acts to relax smooth muscle) or Ani9 (inhibits Ca-activated chloride channels), and contractions were induced through activation of a blue light-activatable channel rhodopsin 2 (introduced through electroporation).

      Strengths:

      (1) Use of multiple methods to alter peristalsis in initial studies.

      (2) Live imaging.

      (3) Careful measurements.

      (4) Nicely presented figures.

      Weaknesses:

      (1) Only Nifedipine inhibition was examined for cell positional changes.

      (2) Ki67 was not carefully analysed, and apoptosis was not shown at all.

      (3) The results shown are suggestive of a role for peristalsis in the lengthening of the cecum. Demonstration that increased peristalsis could further increase lengthening would be helpful.

      (4) The novelty of this work is incremental for the field in that the reagents used and the model of smooth muscle driving gut lengthening in mouse and chick small intestines have both previously been published. This manuscript does suggest that the role of smooth muscle in longitudinal growth may extend to other tubular organs (chick cecum).

    5. Author response:

      We sincerely appreciate the efforts of the Senior and Reviewing Editors, as well as the three reviewers, for their careful evaluation of our manuscript and their insightful comments. Previous studies have suggested that smooth muscle activity contributes to gut elongation; however, these studies do not directly demonstrate that peristaltic movements per se drive elongation. For example, studies in mouse have primarily focused on residual stress of smooth muscle (Yang et al., 2021), rather than the dynamic spatiotemporal nature of peristalsis. In chickens, inhibition of peristalsis by nifedipine has been interpreted as evidence for a role of peristalsis in gut elongation (Khalipina et al., 2019). However, because nifedipine broadly affects calcium-dependent cellular processes, these experiments cannot distinguish whether the observed effects arise specifically from loss of peristalsis or from other cellular perturbations. In our current study, we aimed to challenge this limitation by combining pharmacological inhibition with optogenetic reactivation. This approach allows us to selectively restore peristaltic movements under conditions in which endogenous peristalsis are suppressed. Based on these experiments, we provide evidence supporting a causal contribution of peristalsis to the anisotropic gut growth. We agree with the reviewers that the positioning of our study relative to previous work should be clarified. In a revised manuscript, we will more clearly distinguish between static mechanical tension and endogenous peristaltic movements, and better define the conceptual advance of our study. In addition to macroscopic growth analysis, we identified cellular dynamics associated with elongation, including circumferentially oriented cell division and peristalsis-dependent longitudinal cell rearrangement. We agree that the mechanistic link between peristalsis and downstream cellular behaviors remains incompletely understood. In the revised manuscript, we will clarify this limitation and outline future directions, including experiments to test the role of mechanical cues (e.g., mechanical perturbation and pharmacological manipulation of mechanotransduction pathways).

      Public Reviews:

      Reviewer #1 (Public review):

      The mechanism by which peristalsis and the cell rearrangement are mediated

      We appreciate this important point. As suggested, the possibility that mechanical aspects of peristalsis contribute to the gut elongation is highly plausible. To address this, we plan to perform additional experiments aimed at isolating the mechanical component of peristalsis. Furthermore, we will investigate the involvement of mechanotransduction pathways, including Piezo-mediated pathway, using pharmacological approaches. We will revise the manuscript to better discuss these possibilities and clarify the current limitations of our study.

      The novelty and positioning of our study

      We appreciate this comment and have addressed this point in the General response above. In the revised manuscript, we will more clearly position our study relative to the previous studies.

      Reviewer #2 (Public review):

      Longitudinal separation of daughter cells even without peristalsis

      We appreciate this insightful and important comment. As noted, daughter cells can exhibit longitudinal separation even under nifedipine treatment, whereas the divergence index (DI) shows a clear increase only in the control (with peristalsis) condition. We interpret this as follows; immediately after cell division, two daughter cells occupy nearly identical positions along the longitudinal axis, and stochastic fluctuations may cause them to separate each other. Such local separation does not necessarily reflect population-level cell rearrangement. In contrast, DI captures collective dispersion of a cell population, which reflects organized tissue-level rearrangement associated with elongation. We will revise the manuscript to clarify this distinction between local cell behavior and population-level dynamics, and to better explain how DI reflects elongation-related processes.

      Contributions from other gut layers and ECMs

      We agree that contributions from other tissue layers and extracellular matrix (ECM) components might be important. To address this, we plan additional experiments including targeted ablation of specific tissue layers and pharmacological manipulation of ECM remodeling (e.g., using MMP modulators). We will also expand the Discussion to better acknowledge these factors.

      Reviewer #3 (Public review):

      (1) We agree that experiments based solely on nifedipine treatment cannot fully exclude potential off-target effects. To address this limitation, we plan to perform additional experiments that rescue the mis-rearrangement of cells by applying mechanical forces.

      (2) We agree that more elaborate analyses of cell proliferation and apoptosis are needed. In the revised manuscript, we will incorporate additional analyses using appropriate markers and methods suitable for developing gut tissue.

      (3) In Figure 2, we had already shown an increased the frequency of peristaltic contractions (30 s intervals, Fig. 2i, j, k, n). This did not result in a significant increase in elongation or widening compared to the control condition (120 s intervals). This suggests that the effect of peristalsis on elongation may reach a plateau at a certain frequency. We will revise the manuscript to clarify this interpretation and discuss its implications.

      (4) We appreciate this important comment and have addressed the issue of novelty and positioning in the General response shown above.

      Reference

      Yang, Y. et al. Ciliary Hedgehog signaling patterns the digestive system to generate mechanical forces driving elongation. Nat. Commun. 12, 7186 (2021).

      Khalipina, D., Kaga, Y., Dacher, N. & Chevalier, N. R. Smooth muscle contractility causes the gut to grow anisotropically. J. R. Soc. Interface 16, 20190484 (2019).

    1. eLife Assessment

      This valuable study builds a novel auditory-motor paradigm to investigate how the brain learns associations between movements and their auditory consequences. Solid evidence is provided for early ERPs (50-100 ms latency) reflecting violations of established key-pitch mappings. The writing, however, could be streamlined to better emphasize the paper's key contribution, and some statistical analyses might be improved.

    2. Reviewer #1 (Public review):

      Summary:

      Zhang et al. report on an ambitious study that investigates multiple aspects of the neural and behavioral underpinnings of auditory-motor surprisal in the context of an auditory-motor learning paradigm (piano keyboard). Using an intricate design comprising several sub-parts and control procedures, they report that early ERPs (50-100 ms latency) reflect violations of established key-pitch mappings.

      Strengths:

      This is a carefully devised and executed study. The paradigm is quite intricate and, at the same time, addresses multiple aspects of auditory-motor learning, and does so in a rigorous way.

      Weaknesses:

      Perhaps because of the exhaustive approach, it is sometimes difficult to follow which parts of the experimental design the results come from; there are some questions regarding appropriate statistical methods, the inclusion/treatment of musical background in participants, and the nature (latency & extent) of the identified neural components that detect auditory-motor violations.

    3. Reviewer #2 (Public review):

      Summary:

      Zhang et al. report an EEG study (n=18) of participants playing a keyboard where the correspondence between keys and pitches is varied to introduce sensory-motor mismatches (discrepancies between sensory inputs and expected sensory consequences of motor commands). They find that the auditory N100 amplitude is enhanced for the initial keystroke following a mapping switch but rapidly attenuates for subsequent keystrokes (showing rapid updating of the forward model), whereas the motor-related P50 amplitude only differentiates trained versus untrained mappings after 30 minutes of goal-directed practice (potentially showing timescales of inverse model updating). Using parallel univariate and mTRF decoding analyses, they conclude that forward models (mapping action to predicted sound) update almost instantly to track short-term context, while inverse models (mapping sound to motor commands) update slowly and require extended, targeted practice.


      Strengths

      (1) Methodological innovation:<br /> The study utilizes an interesting, continuous auditory-motor paradigm that moves beyond standard trial-by-trial oddball designs, offering a more ecologically valid measure of trial-to-trial adaptation.

      (2) Analytical elegance and rigor:<br /> The combination of traditional univariate ERP analyses with multivariate temporal response function (mTRF) decoding is elegant, allowing the authors to successfully dissociate overlapping auditory and motor variance streams.

      (3) The dissociation between the rapid adaptation of the N100 forward model and the slower adaptation of the P50 inverse model is interesting.

      Weaknesses

      (1) Confounded passive listening baseline:<br /> The passive listening control condition lacks an orthogonal behavioural task (e.g., an occasional oddball detection task). Active playing inherently necessitates focused attention on auditory feedback to monitor performance, whereas passive playback does not. The globally weaker stimulus-evoked pattern at electrode Fz during passive listening strongly suggests that the absence of an N100 effect in this condition may simply reflect a lower state of attention, rather than isolating the absence of a motor-driven forward prediction, in particular because the pure sensory suprisal was also enhanced for "firsts" notes, so this could also lead to stronger N1, but this effect may be masked.

      (2) Overclaimed theoretical novelty:<br /> The conceptual framing leans excessively on the authors' specific "MirrorNet" framework, presenting foundational, decades-old tenets of the motor control literature (i.e., unsupervised exploration for forward models vs. supervised skill acquisition for inverse models; Wolpert, Jordan, both in the nineties) as their own novel "conjectures." This theory-heavy introduction obscures the paper's actual empirical contribution to the design and the interesting question regarding the distinct temporal adaptation scales of forward versus inverse models. I think some rewriting can improve the paper.

      (3) Misplaced surprisal terminology:<br /> In a similar vein, I find the use of the term "auditory-motor surprisal" more theoretical grandstanding than actually useful. The significance statement claims to "extend this principle from sensory processing" but in fact, the concept of sensory motor unexpectedness is again a staple of the forward motor literature. Moreover, nowhere in the paper do they actually estimate sensorimotor surprisal. While the authors compute surprisal for their auditory baseline using IDyOM, their central sensorimotor analysis relies entirely on a simple categorical mismatch (first vs. subsequent keystrokes). The phenomenon can equally be referred to by its established nomenclature-"sensorimotor mismatch" or "sensory motor unexpectedness".

      (4) Incremental conceptual advance regarding the N100:<br /> The paper frames the N100 finding as a major discovery, but as far as I know, the attenuation of the auditory N1 to self-generated sounds via accurate motor prediction-and its enhancement during sensorimotor mismatch - is one of the most heavily documented phenomena in the auditory-motor literature (e.g. Timm et al., 2013; Bendixen et al, 2012; 2013). As far as I'm concerned, the authors should clarify that the novelty lies in the novel, elegant design that provides a new way to correct for non-sensory-specific motor-induced attenuation, and characterizing the distinct adaptation timescales of forward versus inverse models  -- not in demonstrating N100 modulation by sensorimotor mismatch, which is well-documented, AFAIC.

    1. eLife Assessment

      This useful study asked whether the behaviour of motor units from a hand muscle changed across the two mechanical actions it performs. The authors used high-density intramuscular electrodes to record the activity of several motor units and reported changes in motor unit recruitment order across tasks that were not dependent on motor unit properties, suggesting differential spinal contributions to the two actions. However, the evidence supporting their main claims is incomplete, and some of the conclusions are based on unsubstantiated assumptions: the authors should correct several key analyses and temper claims that are not directly backed up by their data.

    2. Reviewer #1 (Public review):

      Summary:

      Osswald and colleagues aim to show how motor units of the first dorsal interosseous (FDI) are flexibly recruited across two functionally different movements: index finger abduction and index finger flexion. They motivate this by arguing that FDI is the prime mover in abduction but acts as a synergist in flexion, alongside flexor digitorum profundus (FDP) and flexor digitorum superficialis (FDS) as the prime movers. This is a worthwhile question because it speaks to how descending neural inputs to the spinal cord flexibly control movement.

      The authors claim that recruitment order and recruitment threshold of FDI motor units differ between abduction and flexion, and that beta-band intramuscular coherence is reduced when FDI acts as a synergist. However, there are significant methodological concerns that undermine the results and conclusions.

      Strengths:

      The study certainly aims to address a central question in motor neuroscience - how flexible recruitment of motor units occurs across movements where the same muscle changes its functional role. They correctly identify the FDI as a multi-functional muscle and use intramuscular high-density EMG arrays to record several motor units simultaneously, which is a major technical strength. They also track individual motor units between conditions and, therefore, have generated a potentially valuable dataset for studying spinal motor control across different movements.

      Weaknesses:

      The key limitation comes from the authors' interpretation of "neural drive" to FDI. The authors acknowledge that global EMG during flexion is smaller than that during abduction (for the same force), and surmise that the FDI receives different amounts of neural drive between these two movements, which is a potential confound for their analyses. To match the neural drive (i.e., global EMG), the authors ask participants to generate the same global EMG in flexion as in abduction; the forces generated by FDI are significantly different (2-3N for abduction and 1-8-6.2 for flexion). From this, they find changes in recruitment order, recruitment threshold, and beta coherence. However, different FDI motor units (and different muscle fibres) are active during abduction versus flexion. Using global EMG as a proxy for neural drive ignores this spatial separation of EMG generation during abduction and flexion, such that some amount of global EMG generated by one part of FDI (during abduction) is considered the same (from a neural drive perspective) as the same amount of EMG generated by a completely different part of FDI (during flexion). But these two global EMGs (during abduction and flexion) are not biologically equivalent because they are generated by different motor units and muscle fibres. Consequently, neural drive during flexion and abduction is not equivalent, which makes biological interpretation less clear. Furthermore, it is difficult to tell if abduction-versus-flexion differences are due to task role (prime mover vs synergist) or differences in force/mechanical demands, multi-muscle coordination, and spatial sampling limits of intramuscular recordings.

      As mentioned, we think that the question asked is a very interesting one and framed appropriately to investigate the behaviour of motor units during prime mover and synergist roles. Simultaneously recording the prime movers for index flexion (FDP and FDS) would significantly improve the completeness of the study and allow for multi-muscle comparisons that are more relevant to how the motor system resolves prime mover vs synergist roles.

      The authors use motor unit action potential as a proxy for motor unit size. This is not suitable because muscle fibres closer to the electrode will appear larger, independent of their true size. We advise that the authors remove analyses pertaining to motor unit size if it cannot be accurately measured.

      Finally, several mechanistic interpretations in the discussion (e.g., spinal interneuronal suppression, reduced corticospinal input, proprioceptive mechanisms) read as more speculative than the current data can support without added controls or citations.

    3. Reviewer #2 (Public review):

      In this study, the authors examine whether the structure of motor unit (MU) recruitment and firing varies across movement directions in the human first dorsal interosseous (FDI) muscle. While task-dependent changes in MU recruitment have been reported previously (e.g., Thomas et al. 1986), these findings were largely based on recordings from a limited number of isolated single motor units. By applying high-density intramuscular electromyography and decomposition techniques, the authors demonstrate similar phenomena at the level of larger MU populations, thereby providing a useful consolidation of prior observations. In addition, they show that recruitment thresholds shift across tasks while the inverse relationship between discharge rate and recruitment threshold (the "onion-skin" organization) is preserved, suggesting that the overall structure of inputs to the motoneuron pool remains stable despite changes in recruitment order. Furthermore, by analyzing intramuscular coherence across MU firing, the authors attempt to characterize differences in the extent of synchronization among frequency components of neural inputs between abduction and flexion of the index finger. In particular, they report reduced beta-band coherence during flexion compared to abduction, indicating decreased synchronization in this frequency range (13-30Hz). This observation is noteworthy, as it points to potential differences in the neural inputs underlying these task-dependent changes.

      A key strength of the study is that it extends prior work on task-dependent MU recruitment to larger populations using state-of-the-art recording and decomposition approaches. This represents a meaningful technical and conceptual advance over earlier studies limited to small numbers of units. The finding that recruitment shifts between flexion and abduction occur consistently across MUs, independent of motor unit size, further strengthens the robustness and generality of the observed phenomenon. Together, these results provide convincing evidence that MU recruitment is not strictly fixed by a rigid size principle across functional contexts and thus make a valuable contribution to the literature on motor control.

      However, several aspects of the mechanistic interpretation are less well supported. The authors interpret their findings as reflecting a "redistribution" of net excitatory input to the motoneuron pool across tasks. While this is a plausible interpretation of the observed changes in recruitment thresholds and recruitment order, it is not directly demonstrated by the analyses presented. The current data do not clearly distinguish redistribution of inputs from alternative explanations, such as task-dependent modulation of shared versus independent inputs, or changes in the effective gain of existing pathways. As such, the evidence for a specific redistribution of input remains incomplete.

      The interpretation of the intramuscular coherence analysis represents a further key weakness. By computing frequency-specific coherence across MUs during abduction (as a prime mover) and flexion (as a synergist), the authors report reduced beta-band coherence during flexion and interpret this as evidence for attenuated corticospinal input and increased involvement of spinal circuits. However, the relationship between changes in downstream coherence and the magnitude of upstream neural drive is not well established. Coherence reflects the synchronization of inputs rather than their net strength, and therefore, a reduction in coherence cannot be directly interpreted as a decrease in input from a specific source. Moreover, coherence measures alone do not permit identification of the origin of the inputs, and thus do not provide sufficient evidence to attribute the observed differences to descending or spinal pathways. While the difference between tasks is clear and potentially informative, the mechanistic interpretation appears overstated and should be treated more cautiously.

      A related issue concerns the interpretation of the preserved RT-DR relationship. While this finding supports the presence of a stable common input structure across tasks, the additional claim that proprioceptive feedback contributes significantly to maintaining this organization is not clearly justified by the presented data. No direct evidence is provided to dissociate afferent from descending inputs, and the absence of task-dependent differences in lower-frequency coherence further limits support for this interpretation. As such, the proposed role of proprioceptive feedback appears speculative.

      Overall, the authors successfully achieve their primary aim of demonstrating task-dependent flexibility in MU recruitment at the population level, and the results provide useful empirical support for this phenomenon using modern techniques. The study is likely to be of interest to researchers in motor control and neuromuscular physiology, particularly given the increasing relevance of MU-level analyses in both basic and applied contexts. However, the broader mechanistic conclusions regarding the nature and origin of the underlying neural inputs are not fully supported by the data and would benefit from more cautious interpretation or additional experimental evidence.

    1. eLife Assessment

      The authors make the valuable observation that directional memory during epithelial cell migration is enhanced compared to single-cell migration. They attribute this effect to adherens junctions and vinculin dimerization. In the work, central measures should be defined more precisely, and the support for their claims about the roles of adherens junctions and vinculin dimerization in memory enhancement remains incomplete.

      [Editors' note: this paper was previously reviewed by another journal.]

    2. Reviewer #1 (Public review):

      Summary:

      In this work, the authors study the migration of isolated cells and of cells in ensembles. They quantify several aspects of the corresponding migration patterns and investigate how these quantities depend on molecules that are known to play an important role in migration. Furthermore, they study the effect of external cues on these migration processes.

      Strengths:

      The authors provide a clean and uniform setting for comparing the migration of isolated cells and of cells in an ensemble in control and mutant conditions, and in the presence and absence of external cues. This allows for a meaningful comparison between different conditions. In this way, the authors obtain useful data that link the migration of isolated cells to that of cells in collectives.

      Weaknesses:

      A major weakness of the manuscript is that the authors do not properly introduce the quantities and concepts they are working with. In this way, it is hardly accessible for a reader who does not have a thorough background in cell migration and anomalous transport. In addition, the manuscript uses some notions that are not standard, for example, vinculin or FA stability, which are not properly introduced. Most strikingly, "collective directional memory" is not defined.

      The authors infer relationships between different quantities, but they remain qualitative, even though the authors use a language that suggests otherwise. For example, "The combination of Focal Adhesion stability and force transmission from the cytoskeleton predicts the migration speed of single cells" (p 2). I am not sure what is meant by prediction, but this heading suggests that knowledge of FA stability and force transmission yields the migration speed. Reading this line, I expect that if I give you values for FA stability and force transmission, you would give me a value for the migration speed. Such a quantitative mapping is not provided. In fact, it cannot be provided, because - as mentioned before - these quantities are not properly defined, so I would not know how to measure them. I do not even know their units.

      Furthermore, the authors do interpret some of their results without explaining or justifying the basis for their interpretation. For example, they use the FRET index of vinculin - another notion that is not properly introduced - to make statements about mechanical stress.

      It also seems that the figures could be improved. Some of the sketches are, in my opinion, not helpful. Examples are Figure 3A (how could a cell move while the hexagonal arrangement of the cells is maintained?) or Figures 2F, 4F, and 6F (what do the colored ellipses indicate?). In Figures 1B, 1D, 2A, 2E, 3B, 3D-F, 4A, 4F, 5B-D, it is not clear which lines merely connect data points and which lines are fits to the data.

    3. Reviewer #2 (Public review):

      Summary:

      The manuscript by Canever et al was assessed by three Referees at another journal, who brought up a range of critical points. I will not repeat a summary of the work; this can be found in the first-round reviews.

      Strengths:

      In their revised manuscript, the authors include substantial changes and additional reasoning. Along with their rebuttal letter, I think they make a very convincing case. While the claims are well supported by the analysis, I do not see that the findings need to be universal to be relevant. It might be rather surprising to me if there existed such a universality, in fact. I think that the findings are solid and interesting in their own right and are worthy of publication, especially with the amended discussion in this revision.

      Weaknesses:

      However, while the more bio-oriented parts are not fully accessible to me, I do have a few points from the data analysis point of view that need amendment.

      (1) The used mathematical models need to be specified more precisely. First, the authors confuse Levy flights and walks. These are distinct processes in the sense that a Levy flight does not have a finite variance and thus no finite speed. The proper model here would be Levy walks. As in a big body of the literature, both notions are used interchangeably here, while they are distinct processes. Then the authors speak about a "superdiffusive model", for which I do not find a proper definition. There exists an entire range of superdiffusive models, each with a different physical background, so this needs more clarity. The authors may consult one of the standard reviews for more details, e.g., Soft Matter 8, 9043 (2012) or Phys Chem Chem Phys 16, 24128<br /> (2014). Overall, a few equations (maybe in the Supplement) would help to be more specific.

      (2) For fractional Brownian motion, the authors should check the displacement correlation function; it should show slowly decaying, positive correlations. More details on the practical analysis of FBM can be found, e.g., in Phys Chem Chem Phys 27, 14350 (2025). These correlations should decay as a function of the bin time, e.g., as discussed for the opposite case of subdiffusion in Phys Rev E 88, 010101(R) (2013) [cf Fig 3b]. In general, FBM was determined to be a highly relevant process for a number of systems, including amoeba cells at shorter times, see the detailed analysis in Phys Rev Res 4, 033055 (2022). In this paper, there are also different ways to characterise the motion in terms of scaling. Exponents are detailed.

      (3) Some relevant approaches discussed in literature that should be discussed in the context of this work: eLife 9, e52224 (2020); Rep Prog Phys 86, 126601 (2023); Chaos 35, 023145 (2025). In the context of non-Gaussianity for active particles: Phys Rev E 104, 064615 (2021); New J Phys 25, 013010 (2023).

      (4) In the abstract, I am having some issues with the formulation in the sentence: "This directional memory emerges from fractional Brownian motion". It sounds as if FBM were a fully clarified phenomenon. I would prefer some statement along the lines that the data are consistent with such a mathematical modelling approach.

      After fixing these points, I think the manuscript will clearly warrant being shared.

    4. Reviewer #3 (Public review):

      This manuscript focuses on the presence/origin of directional memory during epithelial cell migration. It starts by analyzing single cells and then moves to more complex systems (confluent layers and scratch assays). The paper first demonstrates that the migration in all of these systems is well-described by persistent random walks, which likely emerge from fractional Brownian motion. This is an important demonstration, as it implies orientation memory in the systems. Then the paper proceeds to attempt to discern the origin of this memory and claims to establish key roles for adherens junctions and vinculin dimerization. While for the most part the manuscript is well-written, there are some significant overinterpretations in experimental results. The largest issue is demonstrating the role of vinculin dimerization, which is not a well-studied phenomenon inside living cells, as all data is reliant on a single point mutation (Y1065E). Additionally, the authors seem to be over-interpreting several of the assays; the statistical analysis does not seem to encompass all comparisons made, and the molecular model proposed does not clearly explain the observed results. The discussion could also be strengthened by considering other aspects of vinculin behavior (e.g., vinculin catch bonding) as well as discussing some other recent similar papers.

      (1) Likely the most significant issue with the manuscript is the interpretation of the vinculin Y1065E variant and the assumption that the only defect the mutations cause is a lack of dimerization. Vinculin dimerization is mediated by a conformational change in the vinculin tail domain induced by F-actin binding (Thompson, FEBS Letters, 2013). Dimerization of the vinculin tail domain has been clearly demonstrated in in vitro systems involving purified proteins, as the authors point out in the manuscript. However, the dimerization of full-length vinculin has not been well characterised in living cells. There are several reasons to suspect dimerization is potentially not prevalent in cells. For instance, in the presence of other actin-binding proteins, there may not be sufficient binding sites available on neighboring actin filaments to facilitate dimerization. Additionally, pY1065 vinculin and vinculin Y1065E have been associated with increased vinculin activation (Huang, JBC, 2014), so other effects seem possible. While the Y1065E variant clearly has an effect on the tension sensor readout and vinculin dynamics, further experimental evidence is needed to show that these effects are due to a lack of dimerization in living cells. To justify the definitive claims made in the manuscript, the authors likely need to develop, or employ, an assay for detecting vinculin dimerization in living cells. The authors could choose between intermolecular FRET, proximity labeling assays (i.e., antibodies with DNA for signal amplification), bimolecular fluorescent complementation (i.e., split GFP) based approaches, or some other approach. It should be noted that working with full-length vinculin, not just Vt, and designing an assay that can incorporate vinculin Y1065 variants (Y1065E and potentially Y1065A/F) would strengthen results. Also, the authors should be aware that the observation of strong dimerization may invalidate the use of FRET-based tension sensors in this system or at least necessitate intermolecular FRET control experiments.

      (2) The authors have seemed to assume that FRAP and adhesion stability are interchangeable. To this reviewer's knowledge, this is not the standard in the field. FRAP informs about molecular dynamics. Stability assays, which probe the spatial position of an entire focal adhesion over time (Zaidel-Bar, JCS, 2007, although other approaches are equally suitable), are typically used for assessing adhesion stability. If the authors wish to make strong claims about the stability of the adhesions, non-FRAP-based assays should be employed. Alternatively, the authors could interpret the FRAP data simply in terms of vinculin dynamics.

      (3) A major conclusion in the manuscript is that in response to overexpression of a specific vinculin construct, focal adhesions behave the same in single cells, confluent cells, and collectively migrating cells for all the mutants but Y1065E. However, outside of the FRET measurements, there is not much evidence to support this claim. The authors should perform a greater comparison of the focal adhesions between the systems used in the manuscript (single cell, confluent cells, collectively migrating cells). Key measurements would include focal adhesion number per cell, focal adhesion size, focal adhesion orientation, vinculin dynamics (e.g., FRAP), focal adhesion stability, and some indicators of focal adhesion composition. For the last aspect, focusing on focal adhesion components that also have roles in adherens junctions, such as VASP, seems appropriate. Without such characterization, it is an overinterpretation to assume that focal adhesions are the same in each system and, therefore, effects are due to vinculin behavior in the adherens junctions.

      (4) What is shown in Figure 3G is not clear. How are P/Po and alpha shown on different areas of the same plot?

      (5) It seems that an insufficient statistical test was used in many experiments. There are comparisons being made between systems (cell migration speed, FRET index...) that are not directly compared in a statistical test. Statistical tests are limited to differences from control (over-expression of full-length vinculin), and consistent increases or decreases (not quantitative values) are taken as evidence of similarity across systems. It seems that a more rigorous and standard approach would be to use an ANOVA/MANOVA with a suitable post-hoc test to perform all of these.

      (6) It is unclear how a lack of vinculin dimerization at adherences junctions perturbs epithelial migration, but the complete lack of vinculin tail, which can also not dimerize, does not. In other words, how can TL "have no other role in cell migration at confluence than those at FAs as in single cells." Notably, the authors do not include the tailless variation in the schematic model figures. These results should be included and explained.

    5. Author response:

      [Editors' note: The authors included an author response to reviews from another journal]

      Reviewer #1 (Comments to the Authors):

      In this manuscript the authors describe that cells in collective movements adopt a superdiffusive behavior to out pace individual cells. This behavior is regulated by cell-cell junctional stability and force transmission. The authors state that speed is regulated by vinculin through mechanosensitivity.

      While is makes intuitive sense that cells may move more efficiently collectively as it reduces their exploratory space and therefore increases their efficiency of movement,

      We agree that this is an intuitive explanation. However, previous literature had shown that confluent cells may or may not migrate depending on conditions that do not solely depend on the space available per cell, but also involve the intrinsic activity of the cell, its cortical tension, and its adhesion with its neighbors, with sometimes counterintuitive effects (doi: 10.1016/J.CEB.2021.07.011). This was the reason that motivated us to investigate how these various ingredients affected space exploration efficiency on different time scales.

      Our results indeed refute the intuition that cells move more efficiently when their exploratory space is reduced by showing that the outcome depends on the time scale considered (Fig. S3B). Specifically, on short time scales (less than 3 hours), the area explored by individual MDCK cells is larger than that explored by MDCK cells at confluence. On a longer time scale (greater than 3 hours), however, the area explored by confluent MDCK cells is larger. This switch is a direct consequence of the change in migratory behavior from persistent random walk to superdiffusion, Moreover, its position in time depends on the cell line: extrapolation of our results on RPE-1 cells suggests that it should theoretically occur after approximately 300hrs, if this time scale was experimentally accessible (Fig. S3F).

      …the role of junctions specifically is less clear.

      We are sorry that we were not able to clearly convey the roles of junctions. We have substantially rewritten our text to address this and all the changes are highlighted in orange. As summarized in Fig. 6F, junctions have three roles. The first role is on persistence, through velocity coordination between neighbors, the second is on speed, through the stability of junctions, and the third role is on directionality, through the sensitivity of the monolayer to the wound edge.

      The first role is evidenced thanks to the comparison of the MSD between single cell and confluent migration assays and the use of the alpha-catenin KD cell line. Alpha-catenin depletion is known to be the most potent disruptor of adherens junctions (DOI:10.1091/mbc.e06-05-0471, , DOI:10.1126/science.aaf7119, (DOI:10.1073/pnas.1002662107, DOI:10.1073/pnas.1119313109), and we show that it significantly alters the superdiffusive behavior that emerges in the confluent migration assay (Fig. 3E,F, 5C). Therefore, junction integrity is critical for the control of cell persistence.

      Moreover, alpha-catenin depletion induces a loss of velocity coordination between neighbors (Fig. S3E), which we show through numerical simulations to induce superdiffusion (Fig. 3G). By contrast, E-cadherin KO and vinculin mutants have no effect on the superdiffusion of confluent cells (Fig. 3E, 4A). Therefore, the critical molecular ingredient is the link provided by alpha-catenin to the cytoskeleton that provides junction integrity.

      The second role of junctions is evidenced thanks to the comparison of cell speeds between single and confluent migration assays with the vinculin mutants (Fig. S4A). Results show that cell speed is reduced of about 10µm/h by confluence, regardless of the mutant except for YE, whose only difference with other mutants is its lower stability (Fig. 4F). This supports that junction stability, and not the other effects of mutants, controls cell speed (we provide a detailed demonstration in the response to the following question). As expected, junction integrity is required as well, as seen from the higher cell speed of the alpha-catenin KD cell line compared to WT (first MSD point in Fig. 3B, E).

      The third role of junctions is evidenced thanks to the comparison between confluent and directed migration assays (Fig. 6A). Results show that the wound healing rate is proportional to cell speed at confluence, regardless of the mutant except for YE, which displays no tension gradient in junctions from front to back cells (Fig. 6C). This supports that such gradient is required for cells to identify on which side is the wound edge. As expected, junction integrity is required as well, as seen from the loss of directional bias of the alpha-catenin KD cell line (Fig. 5F).

      The authors chose vinculin as the basis by which to manipulate tensions at cell-cell junctions, but this comes with considerable drawbacks. Namely, since vinculin appears at both cell-cell and cell-matrix junctions, its role and the role of its mutations is not clear here. The authors state that the collective migration speed is related to junctional stability, but because vinculin is also at FA, how can this be concluded?

      We apologize for the lack of clarity. We hope that the highlighted changes in the revised manuscript will improve this point. As exemplified above, comparing cell migration between isolated cells and confluent cells is essential to enable us to distinguish between the contributions of AJs and FAs. Indeed, since isolated cells lack AJs, the impact of vinculin mutants on single cell migration can only be explained by their effects on FAs. This is how we first determine the effects of vinculin mutants on migration that depend on FAs. Because confluent cells also have FAs, we expect that the effects of vinculin mutants on the migration of isolated cells will still be present in confluent cells, to which will be added the effects of these mutants on AJs and their consequences on migration, if any.

      Therefore, when compared to WT cells, if a given mutant decreases or increases migration speed in individual cells, and does so in confluent cells in the same proportion, then its effects at confluence can be entirely explained by its effects in individual cells, and there are no additional effects of that mutant from AJs. This is indeed what we observe for all mutants except the YE mutant (Fig. S4C), leading us to conclude that none of the vinculin mutants, except the YE mutant, have an effect on migration at confluence that results from AJs. In contrast, the YE mutant has effects on migration at confluence that cannot be explained by its effect on individual cell migration. Therefore, the effects of YE at confluence depend on AJs, whether they result from alterations in AJs, FAs, or both. To distinguish between these scenarios, we proceed by elimination, comparing the effects of YE to those of other mutants on force transmission and adhesion stability, and how these two factors associate with migration speed, as explained below. In FAs, YE alters force transmission differently in individual cells and at confluence, but we already know from Fig. 2 that force transmission in FAs cannot alone explain the speed of migration. This result rules out an indirect effect of AJs on cell migration at confluence through FAs. Furthermore, in AJs, YE affects stability and force transmission, but TL has the same effect on force transmission as YE and we already know that none of the effects of TL on migration depend on AJs (Fig. 3, S4C). This result rules out an effect of force transmission in AJs on migration speed at confluence. We conclude that stability at the AJ level, which is the remaining property specifically impaired by YE, is what regulates migration speed at confluence.

      The manuscript's logic and flow are not clear in some places, making the story hard to follow. As one example, the FRAP data, which the authors suggest is used to investigate vinculin's combined role does not help in this capacity as the interpretation and its connection to the bigger story are not clear.

      We are sorry again for the lack of clarity. We used FRAP data to evaluate the effects of vinculin mutants on adhesion stability. Indeed, mutants have different effects on adhesion stability (Fig. 2E, 4F). In addition, they also have different effects on force transmission (Fig. 2D, 4D,E). The partial overlap in functional alterations caused by the mutants allows us to test the involvement of the overlapping function (here stability) in the overall migration outcome. For example, if two mutants have a similar effect on adhesion stability but different effects on migration speed (such as TL and T12), we can then rule out that speed results from adhesion stability. Similarly, if two mutants have different effects on stability but a similar effect on speed (such as TL and YE), we can also rule out that speed results from stability. We applied the same reasoning to force transmission to conclude that neither adhesion stability nor force transmission alone is sufficient for cells to migrate rapidly. However, the combination of the two enables rapid migration.

      As another example, the information derived from the use of the mutants is not clear in the context of the message in the manuscript since they affect cell-cell and cell-matrix junctions and in some places show results that are counter intuitive and not well-explained, to which the authors admit they are surprising but then do not explain their meaning.

      As such, it is very hard to follow the logic with regard to the information resulting from the mutant experiments.

      We provide above a detailed break-down of our strategy to analyze the results. We regret that our manuscript did not adequately convey our conclusions and we hope that the new version of the manuscript improves this point.

      Proliferation has been shown to play a role in wound healing. Does proliferation change in the various conditions?

      This is an important point. The average speed of cells at confluence is approximately 20 µm/h (Fig. 4B), which means that each cell moves approximately its own size in one hour. During this time, assuming a 16-hour cell cycle, 6% of the cells would have divided, each of them likely pushing one of its neighbors a distance equivalent to the size of a cell. Therefore, cell proliferation accounts for at most a few percent of the total cell movement. For this reason, we can assume that growth does not account for a large part of the movement we observe. This is consistent with previous work showing that proliferation does not contribute significantly to wound healing (DOI: 10.1073/pnas.0705062104, DOI: 10.1083/jcb.201207148).

      Minor comments:

      The authors should provide a better description of the mutants: what does a tailless mutant not bind, or bind differently? More context is needed to help interpret the results. While the mutants have all been published on before, it would be helpful to have more information here so that the manuscript is easier to follow.

      We are sorry that the information we provided was insufficient. We have now detailed the mutations to help the reader understand how interactions are altered.

      Figure 1A is not necessary. Figure 1 overall is fairly predictable as there have been many papers using the persistent random walk as the best model to single cell migration (dating back to the early 1990's). The authors define a new term, angular memory, which they show decreases with increasing delta t as one would predict.

      We acknowledge that persistent random walks have already been observed for individual cells, as in references 3-4 cited in the introduction. Nevertheless, we believe that Figure 1 is important because not all cells migrate as persistent random walkers when isolated. Some migrate in a more exotic manner, resulting in superdiffusive behavior, as in references 5-8 cited in the introduction. Since we observe superdiffusive behavior at confluence (Figure 2), it was therefore necessary to show whether or not single cells were superdiffusive too. We also use this figure to introduce angular memory, a measure that, to our knowledge, has never been used before. According to intuition, it decreases to 0 for persistent random walkers, just as another resembling measure, velocity autocorrelation, would do. However, the angular memory of fractional Brownian walkers does not vanish with increasing delta t (Fig. 3D), while velocity correlation would, just as that of persistent random walkers. This difference makes angular memory much more appropriate for distinguishing between the two migration behaviors, and prompted us to introduce it in the first figure as a reference.

      In the wound healing assay, which cells were measured? Leading edge or interior, and does it matter?

      Figure 5A shows that cells behave differently depending on their distance from the wound. This is because the traces shown correspond to the first few hours of the movie, during which the cells at the front begin to move first. Figure S5A shows the speed of the cells over time after the wound and indicates that the cells reach a stable speed after approximately 3 to 4 hours. Figure S5B shows the speed of the cells as a function of distance from the wound at steady state. These results show that the speed of the cells no longer depends on the distance from the wound at this stage. As indicated in the “Materials and Methods” section, we only considered time points beyond this stage for subsequent analyses of population-averaged MSD and velocity presented in Figure 5, so the location of cells at the front or rear was irrelevant.

      Reviewer #2 (Comments to the Authors):

      To migrate cells must spatially explore their environments, a process that is guided by intrinsic signals (adhesive and mechanical properties, etc) and extrinsic (gradient cues) signals. This exploration can occur on the single or multicellular level. In this study, the authors examine the effect of cell-cell interactions, guidance cues, and cell mechanics in the exploratory capacity of MDCK cells. The authors show that cell-cell adhesion provides a "infinite directional memory for migration" and cell speed is dependent upon the focal adhesion stability, cell mechanics, and the mobility of adherens junctions-these processes are modulated by vinculin.

      My three major concerns with the manuscript are as follows:

      (1) While there is potential new information about the role cell-cell junctions and guidance cues play in cell migration, there is not enough NEW insight presented. Rather the role of vinculin in these processes is expected given what is already known about its ability to control focal adhesion stability, mechanics, and adherens junctions.

      We agree that our cell migration results make sense based on the effects of vinculin mutants on the stability and force transmission of adhesions. Nevertheless, we argue that this was not the only possible scenario. Indeed, we find that none of the effects of vinculin mutants on AJs (except YE) have an impact on cell migration (Fig. S4C). One might have expected that the increased stability provided by the TL and T12 mutants would reduce the speed of collective cell migration, just as the YE mutant increased cell speed due to its altered stability. This is not what we found, and this reveals a nonlinear relationship between AJ stability and migration speed that could be investigated more thoroughly in future studies. Another example is that the effects of the mutants on force transmission in AJs do not impact migration speed at confluence but do impact directed collective migration (Fig. 6). One might have expected that vinculin-mediated force transmission in AJs would impact collective migration, whether directed or not.

      More importantly, we show that the role of intercellular adhesion in cell migration is more complex than expected. Indeed, it depends on the timescale considered: intercellular adhesion is detrimental to short-term spatial exploration and beneficial in the long term (Fig. S3B). Such a timescale-dependent behavior is impossible to predict from previously known effects of the mutants or other molecular considerations. Furthermore, we show that this behavior can be fully explained by the coordination of velocities between neighbors, which depends on intact connections between AJs and the cytoskeleton via alpha-catenin, but is independent of vinculin mutants that connect AJs to the cytoskeleton in parallel with alpha-catenin. One might have expected these connections to also have an impact on velocity coordination, and thus on spatial exploration, but we show that this is not the case (Fig. 3). Finally, we show that directed collective migration has a negligible impact on cell exploration at our experimental timescale (Fig. 5), whereas we initially expected the wound to make migration more ballistic. This reveals that such a directional signal affects spatial exploration at much longer timescales than expected.

      Overall, our results quantify the outcome of competing effects and provide timescales at which one effect outweighs the other in influencing cell migration. We believe this is an original approach that provides substantial new insights into collective cell migration.

      (2) The phenotypes of the cells expressing the mutant vinculins varying greatly. These phenotypes are not addressed despite the fact that they could potentially complicate the analyses. For example, there are dramatic differences between focal adhesion numbers and sizes in the cells expressing the different vinculin mutants; cell spreading is also dramatically altered. Likewise, the T12 mutant vinculin has previously been reported to have increased adhesive strength, increased traction forces, and cell spreading. How does this knowledge change the interpretation?

      We agree that vinculin mutants may have effects on the size and number of FAs, cell spreading, and traction forces that we do not examine here. These consequences can be explained by the effects of these mutants on force transmission in FAs and on their stability, which we report in our work. They do not affect our interpretations. Here, we provide a predictive model of migration speed based on the combination of two consequences of vinculin function, namely stability and force transmission. An interesting avenue for future research would be to assess whether these combinations can be reduced to a single higherlevel effect of vinculin on the cellular phenotype that would be sufficient to predict migration speed. This work remains to be done, as neither FA size and number, cell spreading, adhesion force, nor traction forces alone are sufficient to predict migration speed.

      Along the same lines, it has previously been established that tagged version of vinculin do not efficiently integrate into adherens junctions. Published work from the Nelson laboratory suggests that GFP-vinculins do not localize to cell-cell junctions and work from other laboratories suggests localization occurs only when the endogenous vinculin is silenced.

      We are aware that some GFP-vinculin constructs may not localize as well as the endogenous protein at AJs. This is due to the localization of the GFP tag on the head of vinculin and depends on the length of the linker between GFP and the head of vinculin. The longer the linker, the easier the interaction with AJ partners. Unlike these constructs, the vinculinTSMod sensors we use in our work do not carry a GFP on the head and do not suffer from the same limitations.

      Furthermore, vinculin recruitment to AJs depends on force, with little or no recruitment when tension on the AJs is relaxed (DOI: 10.1038/ncb2055). Vinculin recruitment has in fact already been used as an indicator of AJ tension in Drosophila (DOI: 10.1038/s41467-01807448-8). Consequently, the amount of vinculin visible at the AJs varies depending on the tension exerted on the AJs, which our results confirm: vinculin is more difficult to detect at the AJs in cells located at the front of a wound than in those located at the back (Fig. 6B), which is consistent with the difference in vinculin tension between front and back cells (Fig. 6C) and to the E-cadherin tension gradient between front and back cells (DOI: 10.1083/jcb.201706013). Overall, these results show that vinculin is not always easy to detect at AJs, but this is due to the properties of vinculin, which the constructs we use reproduce better than previous constructs (see also below).

      The images in figure S2 and the prebleach images in figure S4 do not show convincing localization of the mutant vinculins to cell-cell adhesions. This is of special concern given that YE mutant protein hardly has any discernable localization to cell-cell junctions; additionally, none of the mutant proteins were tested for their ability to co-localize with adherens junction components. This raises the question if the parameters being examined and the conclusions drawn from them are affected by a difference in localization.

      We agree that the recruitment of vinculin at intercellular contacts may be difficult to see.

      Besides force-dependent effects mentioned above, other factors are involved. The images shown in Figures S2 and S4 are from live cells in which cytoplasmic vinculin is still present, and its level proportional to the mobility of vinculin. Indeed, the TL and T12 mutants show a more marked contrast between intercellular contacts and the cytoplasm, which is consistent with their greater stability at AJs (Fig. 4F). Conversely, YE shows lower contrast, which is consistent with the lower stability of this construct at AJs (Fig. 4F). The FL construct lies between the two. As a result, the cytoplasmic content can variably mask vinculin recruitment at the AJs depending on the mutant.

      We have now performed additional quantifications of mutant recruitment at intercellular contacts as a function of distance from the basal surface of the cells and relative to their recruitment in FAs, in live cells. Results are shown in the new Fig. S4F. We find that all the constructs are recruited to intercellular contacts with a density that is at most half of that in FAs and that varies along the height. FL shows the highest density, localized more apically, consistent with the localization of an AJ-bound actin belt. The mutants appear to be more homogenously distributed along the height of the lateral surface, which may be explained by their impaired autoinhibition (TL, T12), or mechanosensitivity (YE). This variability also contributes to the difficulty in seeing vinculin recruitment in all cells in a single z-slice.

      To confirm the proper recruitment of vinculin constructs to AJs we have performed immunofluorescence against alpha-catenin and phalloidin on each of the stable cell lines. Results are shown in the new Fig. S4D and E. In these experiments, cell permeabilization allows for the release of some of the cytoplasmic pool of vinculin, which highlights the recruitment of all vinculin constructs to intercellular contacts. There, all vinculin constructs colocalize with alpha-catenin and F-actin, as expected. Additionally, images displayed are maximum intensity projections to mitigate recruitment variability along the height.

      Overall, our results clearly support the localization of vinculin at intercellular contacts, and the differences between the constructs are consistent with the effects of their mutations.

      (3) There is a lack of new mechanistic insight. Conclusions are made about a role of vinculin dimerization. This conclusion appears to be based upon the usage of the mutant version of vinculin Y1065. Did the authors directly measure the ability of this mutant protein to dimerize? Is actin binding also affected.

      The binding properties of the Y1065E mutant, including its dimerization and binding to actin, have already been characterized by other researchers (ref. 40 in our manuscript, as well as DOI:10.1111/j.1432-1033. 1997.01136.x or DOI: 10.1016/j.febslet.2013.02.042). We assumed that these properties are now well established and can be used to explain higher-level phenotypes that we show for the first time, to our knowledge.

      Reviewer #3 (Comments to the Authors):

      Canever et al. tracked two epithelial cell lines on collagen coated glass and showed that isolated cells (non confluent) move as persistent random walkers, whereas confluent monolayers migrate super diffusive, with long range directional memory. By systematically perturbing adhesion machinery they found that focal adhesion mutations mainly tune the speed of single cell tracks, but cannot create long range memory, while force bearing adherens junctions are essential for the super diffusive regime-genetically perturbing them collapses collective memory. These interesting results identify junctional tension as important to switch epithelial cells/sheets between individual and collective search modes - an important quantitative insight that is of clear relevance to cell biologists.

      - The presented data is nicely quantitative and convincing, but I have subtle concerns about the generality of the findings. While the authors show that the differential behavior, they describe is not cell-line specific (MDCK, RPE), there are no experiments evaluating the generality of their conclusions across different matrix conditions. How are the measured migration parameters affected by matrix stiffness? Cell migration on collagen coated glass coverslips is a relatively narrow and artificial condition. How is the collective directional memory expected to behave on softer substrates? The generality of the conclusions could be strengthened by repeating measurements using hydrogels of varying stiffness. Further, it should be discussed to which tissues in the body the selected matrix conditions and migration modes plausibly apply.

      We agree that the generality of our results and the relevance of glass-rigid substrates is an important point. In vivo, epithelial cells rest on a basement membrane with a typical stiffness of approximately 10 MPa, as demonstrated by experimental evaluations on various tissue explants, including renal glomeruli and Bruch's membrane, which are relevant to MDCK and RPE-1 cells (DOI: 10.1111/j.1742-4658.2007.05823.x, DOI: 10.1172/JCI106898, DOI:10.1038/eye.1987.35), we have added these references in the manuscript to support our experimental strategy. In vitro, the most significant effects of substrate stiffness on FAs and cell migration generally occur at much lower stiffnesses, between 0.2 and 100 kPa, and cell phenotypes generally plateau at levels comparable to those observed on glass, even below 100 kPa (DOI: 10.1242/jcs.133645, DOI: 10.1038/ncb3268, DOI:10.1039/c5ib00307e, DOI: 10.1039/c9sm01893j). Furthermore, substrate stiffness has much more moderate effects on confluent cells than on isolated cells. For example, it has been previously demonstrated that confluent layers of MCF10A epithelium showed no change in velocity coordination in the range of 3 to 65 kPa (DOI: 10.1083/jcb.201207148). Therefore, collagen-coated glass appears to be a reasonable model for the basement membrane. Overall, we believe that we have conducted our experiments under physiological conditions, and that our results apply to a wide range of substrate stiffnesses.

      - It would be nice to see how long it takes confluent cell layers to close rectangular wounds of defined size when cells migrate as individual (adherens junctions perturbation) versus collective (wt) (on substrates of different stiffness). Presumably, there should be faster wound closure under the collective regime, at least for simple shaped wounds.

      This is an interesting question, which our results indirectly address. In our study, we measured the wound healing speed of the WT MDCK cell line as well as lines expressing mutant vinculin constructs (Fig. 6A). These results show that this speed ranges from 5 to 15 µm/h depending on the construct expressed (and for reasons that we explain in the manuscript). These values make it easy to estimate the time required to close a wound based on its width. For example, it would take 5 hours to close a 100 µm wide wound for the WT cell line, which has a rate of 10 µm/h (on both sides of the wound).

      Wound closure for cells with disrupted adhesive junctions has already been documented (DOI: 10.1083/jcb.200910041). The results show that wound closure is indeed slower than with WT cells. Although this previous study does not reveal the underlying causes, our work now shows that there are two factors: weaker directional memory due to impaired intercellular coordination and, in the longer term, an additional lack of sensitivity to the guidance signal provided by the wound.

      - Akin to substrate stiffness variation, I am missing experiments that test the effect of cytoskeletal tension on these migration modes. Experiments with Rho kinase or myosin inhibitors could meaningfully broaden the scope of this study.

      Rho kinase or myosin inhibitors applied to cells during the time required to assess migration patterns (a movie recorded overnight is necessary to obtain a statistically reliable calculation of MSD over 3 to 4 hours) are likely to affect many other cellular processes in addition to the cytoskeletal tension directly involved in migration. We believe that the accumulation of these effects will make interpretation of the results very difficult. For example, it has been shown that inhibition of ROCK by Y27 promotes healing of corneal endothelial lesions by affecting proliferation through cyclin D and p27 (DOI: 10.1167/iovs.13-12225), or by improving respiration, which would provide the energy necessary for migration (DOI: 10.1096/fj.202101442RR). Consistently, another study on HaCaT epidermal cells confirms that myosin phosphatase accelerates wound healing through proliferation (DOI: 10.1016/j.bbadis.2018.07.013). In contrast, in HUVEC cells, ROCK inhibition significantly impaired the proliferation and migration of vascular endothelial cells in vitro in a dose-dependent manner (DOI: 10.1097/ICO.0000000000000493).

      Furthermore, previous studies have highlighted that differential contractility at the subcellular level is important for collective migration (DOI: 10.1038/ncb2133, DOI: 10.1083/jcb.201706013), which is not possible to examine with global activation or inhibition of contractility. This prompts the development of more refined and specific measurement and disruption strategies to assess the respective impact of cytoskeletal tension on cell-cell and cell-matrix adhesion mechanisms. Our work, which uses biosensors to assess how this tension differentially affects cell-cell and cell-matrix adhesions, is a step in this direction. The localized spatio-temporal activation or inhibition of myosin subtypes or Rho GTPase regulators specific to these adhesion structures will likely answer these questions in the future, but we believe that the development and application of these approaches will require a substantial amount of work that goes beyond the scope of our study.

    1. eLife Assessment

      This important study fills a major geographic and temporal gap in understanding Paleocene mammal evolution in Asia and proposes an intriguing "brawn before bite" hypothesis grounded in diverse analytical approaches. The work rests on a solid methodological base. Some limitations remain, including uncertainty introduced by pooling different tooth positions, limited dietary interpretation, and the predominantly herbivorous taxonomic focus, which narrows the ecological scope of the conclusions. However, the manuscript provides a substantially strengthened and well-supported contribution, while appropriately inviting further work to clarify dietary trends, broader ecological context, and links between dental trait evolution and environmental change.

    2. Reviewer #2 (Public review):

      [Editors' note: this version has been assessed by the Senior Editor without further input from the original reviewers. The authors have addressed the minor comments raised in the previous round of review.]

      Summary:

      This study uses dental traits of a large sample of Chinese mammals to tract evolutionary patterns through the Paleocene. It presents and argues for a 'brawn before bite' hypothesis -- mammals increased in body size disparity before evolving more specialized or adapted dentitions. The study makes use of an impressive array of analyses, including dental topographic, finite element, and integration analyses, which help to provide a unique insight into mammalian evolutionary patterns.

      Strengths:

      This paper helps to fill in a major gap in our knowledge of Paleocene mammal patterns in Asia, which is especially important because of the diversification of placentals at that time. The total sample of teeth is impressive and required considerable effort for scanning and analyzing. And there is a wealth of results for DTA, FEA, and integration analyses. Further, some of the results are especially interesting, such as the novel 'brawn before bite' hypothesis and the possible link between shifts in dental traits and arid environments in the Late Paleocene. Overall, I enjoyed reading the paper and I think the results will be of interest to a broad audience.

      Weaknesses:

      For the original draft of the manuscript, I had four major concerns with the study, especially related to the sampling, diet, and evidence for the 'brawn before bite' hypothesis. I still believe that the original issues that I raised may be weaknesses of the study. For example, there is still limited discussion on diets (even though the dental topographic analyses used in the study are designed for inferring diets). And I find the results a little challenging to interpret because teeth of multiple positions are included in the same samples, which seems problematic. That said, the authors have addressed each of my previous concerns and have made major revisions, including running new analyses, and thus I support the paper.

    3. Author response:

      The following is the authors’ response to the previous reviews

      Public Reviews:

      Reviewer #2 (Public review):

      Summary:

      This study uses dental traits of a large sample of Chinese mammals to tract evolutionary patterns through the Paleocene. It presents and argues for a 'brawn before bite' hypothesis -- mammals increased in body size disparity before evolving more specialized or adapted dentitions. The study makes use of an impressive array of analyses, including dental topographic, finite element, and integration analyses, which help to provide a unique insight into mammalian evolutionary patterns.

      Strengths:

      This paper helps to fill in a major gap in our knowledge of Paleocene mammal patterns in Asia, which is especially important because of the diversification of placentals at that time. The total sample of teeth is impressive and required considerable effort for scanning and analyzing. And there is a wealth of results for DTA, FEA, and integration analyses. Further, some of the results are especially interesting, such as the novel 'brawn before bite' hypothesis and the possible link between shifts in dental traits and arid environments in the Late Paleocene. Overall, I enjoyed reading the paper and I think the results will be of interest to a broad audience.

      Weaknesses:

      For the original draft of the manuscript, I had four major concerns with the study, especially related to the sampling, diet, and evidence for the 'brawn before bite' hypothesis. I still believe that the original issues that I raised may be weaknesses of the study. For example, there is still limited discussion on diets (even though the dental topographic analyses used in the study are designed for inferring diets). And I find the results a little challenging to interpret because teeth of multiple positions are included in the same samples, which seems problematic. That said, the authors have addressed each of my previous concerns and have made major revisions, including running new analyses, and thus I support the paper.

      This revised submission includes only minor changes aimed at clarifying the main text.

      Reviewer #2 (Recommendations for the authors):

      I appreciate that the authors made many improvements to their study based on reviewers' comments. I don't have any remaining major issues with the paper, but I do have several minor comments.

      Thank you for taking the time to provide additional helpful feedback on our study. We have made minor revisions to the manuscript based on your suggestions. Please see our point-by-point response below.

      Lines 48-50. I reiterate my suggestion in my previous review to explicitly state which clade is being discussed, which is important because several major mammal groups beyond placentals (metatherians, multituberculates, dryolestoids, gondwanatherians) survived the K-Pg and had very different diversification patterns. You mention "mammal taxonomic diversity" but in the next sentence say "This initial placental mammals diversification ..." and later mention "stem placental/eutherian lineages." To stay consistent, you might replace "mammal" (L48) and "placental mammals" (L50) with "eutherian(s)" (usually defined as stem + crown placentals). If you follow this suggestion, then elsewhere in the paper I recommend replacing "mammals" with "eutherians" for consistency.

      Thank you for this suggestion. We modified the use of “mammals” throughout the text to general reference to the group only; specific mentions of the dataset analyzed are revised to “eutherians.”

      Lines 75-83. I respect the authors' hesitancy to reconstruct specific diets for the fossil taxa (L75-83), especially considering that dental topographic analyses (DTAs) often struggle to differentiate diets in extant taxa (e.g., Pineda-Munoz et al. 2016 Methods Ecol Evol). I still think that the authors might be able to interpret dietary trends from their results (e.g., an increase in average OPCR values indicating a shift toward more herbivorous diets) - I think discussing dietary trends would be an interesting discussion topic later in the paper. That said, I also recognize that different DTA results seem to show conflicting dietary trends (based on my limited knowledge of those metrics) so maybe that complicates things too much.

      We concur with Reviewer 2 that dietary inferences of DTA data are premature, especially given the ongoing controversies of its use in studies of extant mammal teeth. We kept our current scope of discussion unchanged.

      Lines 75-77. "early mammals ... are beyond the reach of conventional phylogenetic bracketing approaches to dietary reconstruction." But your fossils (eutherians) are certainly within 'phylogenetic brackets' of modern clades (therians, i.e. Eutheria + Metatheria). Maybe you're alluding to the fossils being stem lineages of extant subgroups like Ungulata, which means we can't bracket them specifically within those eutherian subgroups? So, I recommend revising or expanding your statement for clarity. Also, the considerable phylogenetic uncertainty for Paleocene groups (e.g., Halliday et al. 2015) complicates this issue, which you could mention.

      We modified the sentence to now say “Additional complications with ecomorphological analysis of these stem eutherians include the uncertainty in their dietary ecology, having diverged prior to the crown radiation, and uncertainty in phylogenetic positions of Paleocene taxa [7]; thus, they are beyond the reach of conventional phylogenetic bracketing approaches to dietary reconstruction.”

      Line 84. "We investigated dental topography-performance shifts ...". You haven't introduced dental topography or even mentioned teeth yet, and "performance shifts" is vague. So, this phrase might confuse readers. Maybe you can just erase it and start the sentence with "We investigated the timing of ecomorphological ..."?

      We made the recommended revision.

      Lines 104-105 (and elsewhere). "Dental traits paralleled Paleocene global and regional environmental conditions" and "We found that dental topographic trait variability in Paleocene mammals in south China tracked global and regional climatic changes". These conclusions seem a little too assertive to me. Your sample is grouped into 3 rough time bins (of somewhat uncertain ages) and is from a relatively small geographic range - that seems like very limited information for inferring links between dental patterns and climatic changes, especially global patterns. I think it's worth HYPOTHESIZING that dental traits are linked to environmental/climatic changes (with results like those in Figure 2A & B as evidence to support that hypothesis), but I wouldn't make that claim with any confidence. So, I recommend that you temper your relevant conclusion statements. For example, for Line 105, you could replace "We found ..." with "We posit ..." (L105). I would make similar changes to similar statements throughout the paper (e.g., L243).

      Thank you for this suggestion to temper our phrasing. We edited throughout the text to make our interpretations less assertive.

      Figure 1 (and your response to reviewers). Why was the timescale changed to 65.5 Ma for the K-Pg boundary? The K-Pg is 66 Ma (not 65.5), which is the age you mention in the text (e.g. Pg 3 L39) and is well established in the literature - see recent papers from the Paul Renne lab for a more exact age.

      We revised the figure to have the K-Pg at 66 Ma.

    1. eLife Assessment

      This valuable paper describes the regulation of the association of meiotic chromosome axis proteins on chromosome ends with sub-telomeric elements in budding yeast. The genome-wide analyses of binding of chromosome components as well as chromatin regulators, complemented with the mapping of meiotic DNA double-strand breaks on chromosome ends, provided solid evidence to support the authors' conclusion. The results in the paper are of interest to researchers in meiotic recombination and the structure of genomes and chromosomes.

    2. Reviewer #1 (Public review):

      The revised manuscript includes several useful additions, and I appreciate the efforts to clarify parts of the analysis. The dataset remains valuable. However, several key issues raised previously are not yet fully resolved and continue to limit the clarity of the main conclusions.

      (1) I appreciate that the authors guide the reader to the relevant regions in the analysis of chromosome fusions (Fig. 2b). However, these subtelomeric regions are not clearly visualized, making it difficult to compare fused and unfused profiles, even though the conclusions rely largely on visual inspection of them. A more direct comparison between fused and unfused ends, together with quantitative summaries (e.g., binned Red1 enrichment and comparisons with internal regions), would make this experiment more convincing.

      (2) The SK1/S288c comparison (Fig. 2c) is an excellent approach, but is currently presented just as profiles, which again requires substantial effort from the reader to extract the relevant information. A systematic analysis across all informative chromosome ends-for example, comparing Red1 levels in syntenic regions using binned log2 fold-change-would more directly test the proposed in cis effect (L168) and clarify the contribution and range of Y'-associated effects. Other factors (e.g. distance from chromosome ends) could also be assessed within this framework.

      Related to this, it is unclear if Y' elements themselves exhibit lower Red1 binding than the genome average. Providing the mean Red1 signal per Y' element would clarify this point and may also aid interpretation of the relationship between coding density and Red1 enrichment.

      (3) The Dot1-Sir3 section is now simpler. However, I still find it difficult to follow the underlying rationale. In particular, it is unclear why a Dot1 function dependent on H3K79 methylation is introduced, given that the data in the previous section suggest H3K79 methylation is dispensable for subtelomeric Red1 depletion. A clearer statement of the authors' working model would be helpful.

    3. Reviewer #2 (Public review):

      Summary:

      In this manuscript, Raghavan and his colleagues sought to identify cis-acting elements and/or protein factors that limit meiotic crossover at chromosome ends. This limitation is important for avoiding chromosome rearrangements and preventing chromosome mis-segregation.

      By comparing protein axis recruitment in SK1 and S288C background, which differ in their number and distribution of Y' elements, the authors show that Y' element have a limited impact on axis protein enrichment. Genetic analyses coupled with ChIP experiments revealed that the differential binding of the Red1 protein in subtelomeric regions requires the methyltransferase Dot1. Interestingly, the lack of Red1 depletion in subtelomeric regions in this mutant does not impact DSB formation. Another surprising finding is that deleting DOT1 has no effect on Red1 loading in the absence of the silencing factor Sir3. Unlike Dot1, Sir3 directly impacts DSB formation, probably by limiting promoter access to Spo11. As now clearly stated in the abstract and the discussion, this explains only a small part of the low levels of DSBs forming in subtelomeric regions and the main mechanisms suppressing crossover close to the ends of chromosomes remain to be deciphered.

      Strengths:

      This work provides intriguing observations, such as the impact of Dot1 and Sir3 on Red1 loading and the uncoupling of Red1 loading and DSB induction in subtelomeric regions.

      The separation of axis protein deposition and DSB induction observed in the absence of Dot1 is interesting because it rules out the possibility that the binding pattern of these proteins is sufficient to explain the low level of DSB in subtelomeric regions.

      The demonstration that Sir3 suppresses the induction of DSBs by limiting the openness of promoters in subtelomeric regions is convincing.

      Weaknesses:

      The section examining the impact of Dot1 and Sir3 remains complex, which is partly inherent to the intricate relationship between Dot1 and Sir3. However, the authors conclude that Dot1 acts independently of its catalytic activity based on the phenotype of the H3K79R mutant phenotype. Although this is possible it is not fully demonstrated as the H3K79R mutant may exhibit its own phenotype independently of Dot1. Unless the authors test the impact of the catalytic dead mutant Dot1-G401R on axis protein enrichment at subtelomeres they cannot claim that Dot1 act independently of its catalytic activity.

      Sir3's impact on DSB induction is compelling, yet it only accounts for a small proportion of DSB depletion in subtelomeric regions. Thus, the main mechanisms suppressing crossover close to the ends of chromosomes remain to be deciphered.

    4. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      Meiotic recombination at chromosome ends can be deleterious, and its initiation-the programmed formation of DSBs-has long been known to be suppressed. However, the underlying mechanisms of this suppression remained unclear. A bottleneck has been the repetitive sequences embedded within chromosome ends, which make them challenging to analyze using genomic approaches. The authors addressed this issue by developing a new computational pipeline that reliably maps ChIP-seq reads and other genomic data, enabling exploration of previously inaccessible yet biologically important regions of the genome.

      In budding yeast, chromosome ends (~20 kb) show depletion of axis proteins (Red1 and Hop1) important for recruiting DSB-forming proteins. Using their newly developed pipeline, the authors reanalyzed previously published datasets and data generated in this study, revealing heretofore unseen details at chromosome ends. While axis proteins are depleted at chromosome ends, the meiotic cohesin component Rec8 is not. Y' elements play a crucial role in this suppression. The suppression does not depend on the physical chromosome ends but on cis-acting elements. Dot1 suppresses Red1 recruitment at chromosome ends but promotes it in interior regions. Sir complex renders subtelomeric chromatin inaccessible to the DSB-forming machinery.

      The high-quality data and extensive analyses provide important insights into the mechanisms that suppress meiotic DSB formation at chromosome ends. To fully realise this value, several aspects of data presentation and interpretation should be clarified to ensure that the conclusions are stated with appropriate precision and that remaining future issues are clearly articulated.

      (1) To assess the chromosome fusion effects on overall subtelomeric suppression, authors should guide how to look at the data presented in Figure 2b-c. Based on the authors' definition of the terminal 20 kb as the suppressed region, SK1 chrIV-R and S288c chrI-L would be affected by the chromosome fusion, if any. In addition, I find it somewhat challenging to draw clear conclusions from inspecting profiles to compare subtelomeric and internal regions. Perhaps, applying a quantitative approach - such as a bootstrap-based analysis similar to those presented earlier-would be easier to interpret.

      The reviewer is correct that we could not simply fuse two ends but had to create translocations that also removed variable amounts of subtelomeric sequence. Targeted translocations require unique sequences, and thus the extent to which telomeric sequences were deleted varied based on the availability of such sequences. As noted by the reviewer this necessarily limits the conclusions that can be drawn. We have expanded the description of this experiment and also explicitly state the limitations of this assay. To improve clarity, we have also included a schematic to better highlight which chromosomal sequences were removed.

      To further probe our finding that subtelomeric axis protein enrichment may largely be encoded in cis, we now compared axis protein enrichment between S288c and SK1, as suggested by reviewer 2. For this analysis, we took advantage of a dataset we had produced previously that measures Red1 enrichment in SK1/S288c hybrid strains, which provide a powerful internally controlled setup that eliminates effects caused by differential timing and synchrony between samples. As now shown in Supplementary Fig. 5, SK1 and S288c differ substantially in their subtelomeric architecture at many ends, including extensive differences in the number and distribution of Y’ elements. Importantly, axis protein distribution was very consistent between SK1 and S288c when correcting for the differences in length of individual chromosome ends, supporting the conclusion that axis protein enrichment levels are primarily encoded in cis. This analysis is now shown in Fig. 2c. These data also indicate that the presence of a Y’ element does not affect axis protein levels beyond displacing the axis-recruiting sequences further into the chromosome interior.

      (2) The relationship between coding density and Red1 signal needs clarification. An important conclusion from Figure 3 is that the subtelomeric depletion of Red1 primarily reflects suppression of the Rec8-dependent recruitment pathway, whereas Rec8-independent recruitment appears similar between ends and internal regions. Based on the authors' previous papers (referencess 13, 16), I thought coding (or nucleosome) density primarily influences the Rec8-independent pathway. However, the correlations presented in Figure 2d-e (also implied in Figure 3a) appear opposite to my expectation. Specifically, differences in axis protein binding between chromosome ends and internal regions (or within chromosome ends), where the Rec8-dependent pathway dominates, correlate with coding density. In contrast, no such correlation is evident in rec8Δ cells, where only the Rec8-independent pathway is active and end-specific depletion is absent. One possibility is that masking coding regions within Y' elements influences the correlation analysis. Additional analysis and a clearer explanation would be highly appreciated.

      Thank you for pointing this out. We now also included Y’ elements in the analysis in Fig 2d. Including the Y’ elements yielded an increase in average coding density near the very ends of the chromosomes. This increase matches the higher level of axis protein binding seen in rec8 mutants in Fig. 3a and is consistent with the previously noted link between coding density and axis protein deposition. We now provide further description in the text and the figure legends.

      We do not have an explanation for why there is no correlation with coding density in the EARs but assume that this reflects the unique regulation of this region (as also implied by Supplementary Fig. 4d). At present, the signals that establish the EARs remain unknown although our data indicate that the Hop1-CBR as well as Dot1 are important for axis protein enrichment in the EARs.

      (3) The Dot1-Sir3 section staring from L266 should be clarified. I found this section particularly difficult to follow. It begins by stating that dot1∆ leads to Sir complex spreading, but then moves directly to an analysis of Red1 ChIP in sir3∆ without clearly articulating the underlying hypothesis. I wonder if this analysis is intended to explain the differences observed between dot1∆ and H3K79R mutants in the previous section. I also did not get the concluding statement - Dot1 counteracts Sir3 activity. As sir3Δ alone does not affect subtelomeric suppression, it is unclear what Dot1 counteracts. Perhaps, explicitly stating the authors' working model at the outset of this section would greatly clarify the rationale, results, and conclusions.

      Thank you for this comment. We reworked the introduction to this paragraph to be more focused on Sir3 rather than Dot1. We hope that this introduction is less confusing and more in line with the data presented in this paragraph. We also expanded the conclusion to suggest the alternative possibility that the Sir complex only becomes a regulator of axis proteins in the absence of Dot1.

      Reviewer #2 (Public review):

      Summary:

      In this manuscript, Raghavan and his colleagues sought to identify cis-acting elements and/or protein factors that limit meiotic crossover at chromosome ends. This is important for avoiding chromosome rearrangements and preventing chromosome missegregation.

      By reanalyzing published ChIP datasets, the researchers identified a correlation between low levels of protein axis binding - which are known to modulate homologous recombination - and the presence of cis-acting elements such as the subtelomeric element Y' and low gene density. Genetic analyses coupled with ChIP experiments revealed that the differential binding of the Red1 protein in subtelomeric regions requires the methyltransferase Dot1. Interestingly, Red1 depletion in subtelomeric regions does not impact DSB formation. Another surprising finding is that deleting DOT1 has no effect on Red1 loading in the absence of the silencing factor Sir3. Unlike Dot1, Sir3 directly impacts DSB formation, probably by limiting promoter access to Spo11. However, this explains only a small part of the low levels of DSBs forming in subtelomeric regions.

      Strengths:

      (1) This work provides intriguing observations, such as the impact of Dot1 and Sir3 on Red1 loading and the uncoupling of Red1 loading and DSB induction in subtelomeric regions.

      (2) The separation of axis protein deposition and DSB induction observed in the absence of Dot1 is interesting because it rules out the possibility that the binding pattern of these proteins is sufficient to explain the low level of DSB in subtelomeric regions.

      (3) The demonstration that Sir3 suppresses the induction of DSBs by limiting the openness of promoters in subtelomeric regions is convincing.

      Weaknesses:

      (1) The impact of the cis-encoded signal is not demonstrated. Y' containing subtelomeres behave differently from X-only, but this is only correlative. No compelling manipulation has been performed to test the impact of these elements on protein axis recruitment or DSB formation.

      Thank you for this comment. Our data indeed appeared contradictory because XY’ ends showed overall lower axis protein enrichment, yet our analysis of chromosome fusions, which also eliminated Y’ elements at some the fused ends, provided no evidence for an effect of Y’ elements at those ends. As also noted in the response to reviewer 1, we now compared axis protein enrichment between S288c and SK1, which differ substantially in their number and distribution of Y’ elements (Supplementary Fig. 5). We found that axis protein distribution and enrichment was very consistent between SK1 and S288c when correcting for the displacement caused by the presence of Y' elements and other subtelomeric sequences (now shown in Fig. 2d). These data support the conclusion that axis protein enrichment levels are primarily encoded in cis and indicate that the presence of Y’ elements does not affect axis protein levels beyond displacing the axis-recruiting sequences further into the chromosome interior (giving rise to the apparently lower axis protein enrichment on XY’ ends).

      (2) The mechanism by which Dot1 and Sir3 impact Red1 loading is missing.

      Although we do not yet understand the precise molecular details of these effects, we nevertheless believe we have obtained several important insights into this mechanism. First, our data indicate that the suppressive effect of the ends primarily impacts the Rec8-dependent loading of Red1, whereas loading via the Hop1-CBR is largely unaffected. The effect of Dot1 thus likely occurs via the Rec8-Red1 interaction. Second, the increase in Red1 recruitment is fully rescued by deletion of Sir3, suggesting that Sir3 becomes a promoter of axis protein recruitment in the absence of Dot1. These dependencies are now outlined in the model in Fig. 9. We would also like to note that the Sir complex was previously shown to impact cohesin in mitotic cells. Thus, a connection between the Sir complex and cohesin is not without precedent.

      (3) Sir3's impact on DSB induction is compelling, yet it only accounts for a small proportion of DSB depletion in subtelomeric regions. Thus, the main mechanisms suppressing crossover close to the ends of chromosomes remain to be deciphered.

      Thank you, we absolutely agree. We had discussed this point in the discussion but now also explicitly state this point in the abstract and expanded the discussion of these findings in the results and discussion.

      Reviewer #3 (Public review):

      Summary:

      The paper by Raghavan et. al. describes pathways that suppress the formation of meiotic DNA double-strand breaks (DSBs) for interhomolog recombination at the end of chromosomes. Previously, the authors' group showed that meiotic DSB formation is suppressed in a ~20kb region of the telomeres in S. cerevisiae by suppressing the binding of meiosis-specific axis proteins such as Red1 and Hop1. In this study, by precise genome-wide analysis of binding sites of axis proteins, the authors showed that the binding of Red1 and Hop1 to sub-telomeric regions with X and Y' elements is dependent on Rec8 (cohesin) and/or Hop1's chromatin-binding region (CBR). Furthermore, Dot1 functions in a histone H3K79 trimethylation-independent manner, and the silencing proteins Sir2/3 also regulate the binding of Red1 and Hop1 and also the distribution of DSBs in sub-telomeres.

      Strengths:

      The experiments were conducted with high quality and included nice bioinformatic analyses, and the results were mostly convincing. The text is easy to read.

      Weaknesses:

      The paper did not provide any new mechanistic insights into how DSB formation is suppressed at sub-telomeres.

      We respectfully disagree with this assessment. We show that the Sir complex suppresses DSB formation at a number of cryptic hotspots in the X elements and the adjacent subtelomeric sequences by causing chromatin compaction. The role of the Sir complex in transcriptional silencing, chromatin accessibility, and DSB formation had not previously been analyzed in the meiotic subtelomeres. That being said, Sir-dependent suppression is clearly not the only mechanism that suppresses DSBs in the subtelomeres, as we only observed DSB formation at a small number of hotspots. This was in and of itself a surprise, in particular given the large scale effect on chromatin compaction. We made an effort to more strongly emphasize the fact that additional layers of regulation must exist in the abstract and in the discussion.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      (1) Evidence for cis-acting suppression by Y' elements requires further support. The authors propose that Y' elements act in cis to suppress axis protein association at chromosome ends. While this is an attractive model, the current analyses do not yet provide sufficient support for it.

      Thank you for this comment. Our data indeed appeared contradictory, because XY’ ends showed overall lower axis protein enrichment, yet our analysis of chromosome fusions, which also eliminated Y’ elements at some the fused ends provided no evidence for an effect of Y’ elements at those ends. As noted above, we now compared axis protein enrichment between S288c and SK1, which differ substantially in their number and distribution of Y’ elements (Supplementary Fig. 5). We found that axis protein distribution and enrichment was very consistent between SK1 and S288c when correcting for the displacement caused by the presence of Y' elements and other subtelomeric sequences. These data support the conclusion that axis protein enrichment levels are primarily encoded in cis and indicate that the presence of Y’ elements does not affect axis protein levels beyond displacing the axis-recruiting sequences further into the chromosome interior (giving rise to the apparently lower axis protein enrichment on XY’ ends).

      (1a) In Figure S4c, the authors masked Y' elements to rule out the possibility that reduced binding within Y' elements themselves accounts for the overall underrepresentation in subtelomeric regions. However, since the authors propose that Y' elements suppress axis protein binding in surrounding regions in cis, it is appropriate to perform this analysis specifically on chromosome ends containing XY'.

      Thank you for this suggestion. We agree that this would specifically affect the XY’ ends. However, given that we did not see a change even with all the ends, we do not expect a change with just the XY’ ends.

      (1b) In Figure 2b-c, the authors conclude that removal of Y' elements by chromosome fusion does not reveal a long-range suppressive effect. However, the spatial extent of Y'-mediated suppression is not defined, making it unclear whether this experiment can test the proposed model. Perhaps plotting the averaged axis protein profile as a function of distance from Y' elements could help define the effective range of suppression and clarify whether the fusion experiment is informative in this context.

      Thank you. As noted above we now compared SK1 and S288c ends, which provided further evidence that Y’ elements do not affect axis protein enrichment beyond displacing binding sites further into the chromosome interior. In addition, we substantially expanded the description of the chromosome fusion experiment to more clearly outline the setup and the limitations of this experiment.

      (2) L402: "one of the first pieces of direct evidence that nucleosomes block meiotic DSB formation in vivo" sounds overstated, considering past publications (e.g., ref 45 and S. pombe ade6-M26 papers).

      We toned this down and added the references.

      (3) Figure 2e and other scatter plots: Correlation coefficients are reported without p-values. If the authors prefer to use confidence intervals from linear regression instead, they should justify this approach.

      We added p-values to all scatter plots.

      (4) Figure 7f. Explain blue dots.

      We apologize for this oversight (also applies to Supplementary Fig. 10). The blue dots are measurements within 5 kb of an X element. The red dots are the rest of the genome. We now included a legend in the panel to clarify this notation.

      (5) Figure 8d. To assess whether the conclusion can be generalized, the authors could plot the MNase and TrAEL-seq signal fold changes (sir3Δ/SIR3) for hotspots within 5 kb of X elements.

      We attempted various analyses in this direction. However, the range of the MNase-seq effect in sir3 mutants is much greater than the effect on DSBs, making it difficult to make any correlative statements. There are clearly additional layers of DSB suppression in the telomeric regions, and loss/gain of nucleosomes is not sufficient to switch hotspots on/off at most hotspots. We now included a statement to this end in the abstract and also further discuss this notion in the discussion.

      (6) Figure S1c. The apparent difference in X-element distribution may be influenced by bin size. This could be tested by repeating the analysis using smaller bins, comparable in size to the X elements, for all regions.

      We thank the reviewer for this thoughtful suggestion. To address this concern, we repeated the analysis using smaller bins comparable in size to X elements (450 bp) across all region types. Specifically, X elements were analyzed per annotated element, while Y′ elements, subtelomeric 20 kb regions, and internal regions were subdivided into fixed 450 bp windows, and mean input coverage was calculated for each window using the same width-weighted approach.

      This reanalysis did not materially alter the overall distribution patterns observed in Figure S1c. We observed only minor shifts in absolute values, which are expected when changing bin granularity.

      Any residual differences likely reflect underlying copy number of X elements at chromosome ends. Importantly, all ChIP signals in the manuscript are normalized to their corresponding input (ChIP/Input), which mitigates potential biases arising from local copy number variation.

      (7) Figure S2. X elements are difficult to find (e.g., chrVII-L).

      We now included arrowheads at locations with full-length X elements. Partial X elements are marked with stars.

      (8) Figure S7. Please indicate the endpoints of spreading.

      As apparent in this figure and also indicated in the quantification in Supplementary Fig. 9a, spreading of the Sir complex is in most cases quite limited. The example in Supplementary Fig. 9b is one of the largest spreads we observed. The scale of the spreading is hard to meaningfully visualize in Supplementary Fig. 8 given the relatively large genomic distances shown in these profiles. We therefore refer the reader to the analyses shown in Supplementary Fig. 9a, which shows chromosome-resolved extent of spreading.

      Reviewer #2 (Recommendations for the authors):

      To go beyond the correlation between the presence of Y' elements and low levels of protein axis binding, subtelomeres could be easily truncated. Analyzing strains with different distributions of Y' elements would also be informative. The correlative analysis could also be expanded to compare how far the influence of Y' elements goes and whether the number of Y' impacts the extent of protein axis depletion.

      We respectfully disagree with the assertion that subtelomeres could easily be truncated. The high repetitiveness of these sequences makes targeted manipulations of the extreme ends where the Y’ elements are located essentially impossible and is the main reasons for the limitations associated with the analysis of the chromosome fusions as outlined in the response to reviewer 1.

      However, we would like to thank the reviewer for their suggestion to analyze different strain backgrounds. We now compared axis protein enrichment between S288c and SK1. For this analysis, we took advantage of a dataset we had produced previously that measures Red1 enrichment in SK1/S288c hybrid strains, which provide a powerful internally controlled setup that eliminates effects caused by differential timing and synchrony between samples. As now shown in Supplementary Fig. 5, SK1 and S288c differ substantially in their subtelomeric architecture at many ends, including extensive differences in the number and distribution of Y’ elements. Importantly, axis protein distribution was very consistent between SK1 and S288c when correcting for the differences in length of individual chromosome ends, supporting the conclusion that axis protein enrichment levels are primarily encoded in cis. This analysis is now shown in Fig. 2c. These data also indicate that the presence of Y’ elements does not affect axis protein levels beyond displacing the axis-recruiting sequences further into the chromosome interior.

      Given the separation between protein axis loading and DSB induction, it would be interesting to test whether the presence of Y' elements influences the frequency and position of DSB induction.

      We agree that this experiment would be very interesting. However, given the experimental challenges associated with targeted manipulation of Y’ elements as outlined above, we believe that this experiment lies outside the scope of this study. Our observations that Y’ elements do not grossly influence axis protein enrichment in their vicinity may also make an effect on DSB formation less likely.

      The effect of Dot1 on Red1 loading is intriguing because it is at least partially independent of its main known target H3K79, yet fully dependent on Sir3. However, this effect extends far beyond Sir3 binding as detected by ChIP. This is surprising because Dot1 has a limited effect on Sir3 binding as detected by ChIP, and SIR3 deletion has no impact on Red1 binding. However, Dot1 was shown to limit Sir3 spreading to 20 kb on average when overexpressed (Katan-Khaykovich and Struhl 2005; Hocher et al, 2018). It would be interesting to test whether the regions affected by DOT1 deletion coincide with the zone covered by Sir3 upon overexpression (Extended Silent Domains: ESDs, Hocher et al., 2018).

      We agree that this would be an interesting analysis. Unfortunately, the available data on the extended silent domains were not obtained in SK1 and, as noted above, the chromosome end structure differs substantially between the strains, preventing direct comparisons without repeating all the relevant analyses in S288c. In addition, the available data was collected in vegetative cells, although this may be less of an issue given that our analyses show similar spreading in vegetative and meiotic cells. However, short of repeating SIR3 overexpression in meiosis (which also would require a different overexpression regimen as galactose interferes with meiosis), we are not in a position to do this analysis.

      As mentioned in the manuscript, the interplay between the Sir complex and Dot1 has been shown to affect checkpoint regulation during meiotic recombination. However, a discussion on how this relates to the observations reported here is missing.

      Thank you. We included a discussion of this role and its relation to our observations.

      Also, it is unclear why the authors did not investigate the impact of Dot1 and Sir3 impact on the binding of Hop1 rather than Red1, given that Hop1 is currently « the most upstream regulator of recombination known to be depleted about 20 kb from chromosome ends. »

      We changed this statement in the introduction to avoid confusion and also included a model figure that specifically highlights the Rec8-dependent recruitment as a regulatory target.

      Our data show that most of the telomere-proximal effects seem to act through the Rec8-dependent recruitment pathway for which Red1 is the most upstream regulator known. So, although the most upstream factor known before this study was Hop1, our data now identify the interaction between Red1 and Rec8 as the most upstream regulatory node.

      Sir3's impact on DSB induction is compelling, yet it only accounts for a small proportion of DSB depletion in subtelomeric regions. Thus, the main mechanisms suppressing crossover close to the ends of chromosomes remain to be deciphered. This should be acknowledged and discussed.

      In addition to the explicit statement of this conclusion in the results, we now added another statement in the abstract and also expanded the discussion of the fact that there are clearly additional levels of regulation that remain to be discovered.

      Reviewer #3 (Recommendations for the authors):

      Major points:

      It would be nice to show a schematic summary of the authors' main conclusion.

      Thank you, we now included a model schematic as Fig. 9.

      Minor points:

      (1) Supplemental Figure 2: A small box for the X element is marked with the same color as the Y' element, and so it is very hard to find the X element. Please use the clearer color, and it would be nice to show the chromosome ends without the X element (lines 129-130).

      We now included arrowheads at locations with full-length X elements. Partial X elements are marked with stars. This notation also makes it obvious which ends lack annotated X elements.

      (2) Line 156-163, Figure 2b: In the main text, "chromosome fusion between chromosome IV right arm and chromosome I left arm" should be mentioned. Moreover, it isn't very clear to have the data in the S288C background. The fusion points are different between S288C and SK1 (the structures of these ends are quite different). Please explain the authors' logic in the text. 

      To improve clarity, we have included a schematic to better highlight which chromosomal sequences were removed. We have also substantially expanded the description of this experiment and explicitly state the limitations of this assay.

      (3) Supplemental Figure 6: Since the sir3 mutation affects the binding of Red1 EARs (and centromeres). It would be nice to show the similar sets for the HML, MAT, and HMR loci (and intergenic regions as a control).

      We are unfortunately statistically underpowered to perform a meta-analysis of just HML, HMR and MAT. However, we now indicated the positions of HML and HMR in Supplementary Fig. 2 and 8, so the binding of the axis proteins and Sir3 can be inspected directly. MAT is not within 50 kb of a chromosome end and thus was not captured in these analyses.

      (4) Line 322-, the section: From here, the authors switched their story from the sir3 to the sir2. It would be nice to provide the logic with a small introduction on the relationship between Sir2 and Sir3.

      We apologize for this confusion. We are not switching our story to Sir2 but rather are taking advantage of an available dataset that analyzed DSBs in sir2 mutants. We then return to Sir3 to also analyze DSBs in the sir3 mutant and analyze its interaction with a dot1 mutation. To better support the logic, we now briefly reiterate that Sir2 and Sir3 are part of the same complex at the beginning of this section.

      (5) Line 330-331, Figure 8a (and also Supplemental Fig. 8c): Would you explain a bit more about matched strain in the text or figure legend? Each dot represents a strain. If so, please show the strains used here.

      Each dot refers to an individual X or Y’ element that is shown matched in WT and mutant to highlight the trends at the level of individual elements. This is noted in the figure legend.

      (6) Supplemental Figure 7 (and 2): It would be nice to show the position of the HML, MAT, and HMR loci as well as the centromeres in the Figure.

      We now indicated the positions of HML and HMR in Supplementary Fig. 2 and 8. MAT and the centromeres are not located within 50 kb from chromosome ends.

    1. eLife Assessment

      This valuable study demonstrates that self-motion strongly affects neural responses to visual stimuli, comparing humans moving through a virtual environment to passive viewing. The evidence for visuomotor mismatch responses is solid, although the interpretation in terms of prediction remains somewhat preliminary. This study bridges human and rodent studies on the role of prediction in sensory processing, and is therefore expected to be of interest to a large community of neuroscientists.

    2. Reviewer #1 (Public review):

      In this paper, Solyga, Zelechowski & Keller study human visuomotor mismatch responses as an alternative instantiation of prediction errors to classic oddball paradigms. Using VR, they created a condition in which participants were moving around thereby creating a visuomotor coupling between physical movement and visual flow. To attempt to isolate the contribution of specifically movement-related predictions in this condition, they contrasted it to a condition in which participants were seated and rewatching their movement trajectory during the 'active' condition. Visuomotor mismatches were created by temporarily decoupling movement and visual experience by halting the VR display as participants continued to move.

      The core finding of the paper is that participants exhibit a positively-valenced response to the visuomotor decoupling in the active but not in the passive condition. Since walking speed only insignificantly slows down following decoupling events in the active conditions, the authors argue that this difference can not be accounted for by "changes in participants' behavior or to simple visual offset responses" with the latter being equal across both conditions. The following reinstatement of the coupling in turn does not differ between the two conditions. The authors additionally show that this mismatch response differs from visual onset responses elicited by checkerboard inversions and that it's "qualitatively" stronger than more commonly studied auditory oddball mismatch responses.

      The design with its focus on ecological validity is impressive, well-rationalized and the results are well illustrated. I additionally appreciate the control analyses with regards to changes in walking speed and playback DOF and, now added, additional participants who experience the passive condition before the active. I have a couple of questions/comments.

      My main question in round 1 regarded the isolation of visuomotor mismatch. Although the comparison with a seated control seems like a very sensible way to control for simple visual responses, there seem to be more differences than just a break in visuomotor coupling between the conditions. I therefore wonder whether the reduced offset response in the seated condition may be, in part, explained differently. For example, given that participants always conduct the active condition before rewatching their movement in the seated condition, it seemed likely that there is a component of learning across the session that flow will sometimes be halted. This is confirmed with the analyses. The explanation that there is a visuomotor component here is given further weight by their conduction of an additional group of participants who perform the conditions in the reverse order, so this has strengthened the manuscript considerably. However, it does of course remain an imperfect control because the visual stimulus is now different between the conditions for these participants. It's the best that can be achieved with this type of paradigm though and of course it yields a great deal of ecological validity.

      I was also wondering whether the authors may consider the findings in frontal electrodes more closely given that the title of the paper focuses on a specifically occipital effect. Their further analyses have confirmed that there are likely interesting frontal effects. From a theoretical point of view, the spatial dissociation in adaptation effects, which were stronger in frontal and weaker in occipital areas, seems interesting and perhaps worth discussing, especially given the interpretation that "mismatch processing may initially arise in sensory visual areas before engaging higher-order frontal regions." How come the frontal decrease in responses is not accompanied by an analogous decrease in its supposed occipital source? Could these two responses reflect different kinds of prediction error signals (i.e. objective vs subjective)?

      I remain concerned that the authors fight too defensively that they have absolutely isolated visuomotor prediction mechanisms with this paradigm. It's a nice, informative study, but it seems odd to argue there are no other possible explanations. One picks a design to optimize some features but they will always come at some cost to others. Prioritising ecological validity, which is a justifiable aim, necessarily usually weakens some control over confounds.

      To outline my reasoning fully: My concerns wrt generic influences of action on perception are reflected in Fig 1. The P1 is smaller when walking than sitting. It seems likely that the mismatch response reflects something about extrapolation or prediction, because it is larger when walking. However, it's not necessarily sensorimotor prediction. Even if you remove action from the equation, the flow can be extrapolated or predicted most of the time in a way it cannot so well when the video is halted. Of course the sitting condition somewhat controls for it, but when it came second the visual flow disruptions were more predictable here. A reduction in effects over time is indeed confirmed with their analyses. They now have conducted a study with the conditions in the reverse order and they find the same thing. But of course this necessitates non-identical visual flow because the sitting condition is playing the previous participant's flow. So it is likely that across all of these comparisons, it is the visuomotor mismatch that is especially salient. It's just that each comparison is a bit messy/confounded. It would strengthen the manuscript if there were some consideration given to the other processes likely at play here.

      As a more minor point in response to our previous review, whether particular accounts represent an 'orthodox' view at present does not determine whether they raise logical issues in need of consideration. The authors may have missed that the papers in question consider mechanisms underlying the attenuation of particular pieces of information *from perception*. Not perceptual processing. We have one percept at any one moment in time and must understand how different population types synergistically generate that percept.

      Similarly a little strange is the way in which the authors aggressively defend the position that self-generated motion is 'the strongest' type of prediction. Sure, we probably experience the effects of our actions more often than ambulances. But what about objects obeying laws of gravity or others' faces being structured and moving in systematic ways? It is hard to quantify, such that presumably many scientists would be skeptical of such a claim, and it is not needed logically to justify the importance of examining mechanisms enabling action to shape perceptual processing. I'd assume it better to fight the battles you need to (and can) fight, such that the robust claims carry more weight.

      Hope these comments are helpful.

    3. Reviewer #2 (Public review):

      Summary:

      This study investigates whether visuomotor mismatch responses can be detected in humans. By adapting paradigms from rodent studies, the authors report EEG evidence of mismatch responses during visuomotor conditions and compare them to visual-only stimulation and mismatch responses in other modalities.

      Strengths:

      - Authors use a creative experimental design to elicit visuomotor mismatch responses in humans.

      - The study provides an initial dataset and analytical framework that could support future research on human visuomotor prediction errors.

      Weaknesses:

      - Methodological issues (e.g., volume conduction) make it difficult to confidently attribute the observed mismatch responses to activity in visual cortical regions. This could be alleviated by increasing the number of channels.

      The authors successfully demonstrate that visuomotor mismatch paradigms can, in principle, be applied in human EEG. This approach provides a translational bridge between rodent and human work on predictive processing.

    4. Reviewer #3 (Public review):

      Solyga, Zelechowski, and Keller present a concise report of an innovative study demonstrating clear visuomotor mismatch responses in ambulating humans, using a mobile EEG setup and virtual reality. Human subjects walked around a virtual corridor while EEGs were recorded. Occasionally, motion and visual flow were uncoupled, and this evoked a mismatch response that was strongest in occipitally placed electrodes and had a considerable signal to noise ratio. It was robust across participants and could not be explained by the visual stimulus alone.

      This is an important extension of their prior work in mice, and represents an elegant translation of those previous findings to humans, where future work can inform theories of e.g. psychiatric diseases that are believed to involve disordered predictive processing. For the most part, the authors are appropriately circumspect in their interpretations and discussions of the implications. The paper in its current form represents an important addition to the literature.

      The authors have included analyses of the auditory mismatch using temporal electrodes, referenced to Cz (and therefore should exhibit a mismatch positivity). This added data clearly and convincingly shows that the sensorimotor mismatch is, indeed, stronger than the passive auditory MMN.

      - The reference electrode placed at Cz makes it is difficult to interpret relative differences between frontal and occipital electrode responses, as the occipital electrodes are placed farther away from the Cz reference than the frontal electrodes. Similarly, signal occuring cortically near the Cz reference might only appear as though it is occipitally distributed in this montage. It is common in EEG research to re-montage the data to an averaged common reference in order to better interpret the scalp distributions. As the electrode coverage was sparse for some subjects, this could be challenging, and this reviewer does not feel that it is necessary to do this analysis step, or even to drastically rewrite the body of the paper. We only request that some discussion, however brief, is included in the discussion section or the methods that recommend more dense electrode coverage in the future to better interpret scalp distributions and potential meso-scale sources.

      - This is just a suggestion. The authors are encouraged to analyse (and report) time-frequency power and phase locking for these mismatch responses, as is common in much of the literature (see Roach et al 2008 Schizophrenia Bulletin). This is not to say that doing so will yield insights into oscillations per se, but converting the data to the time-frequency domain provides another perspective that has some advantages. fosters translations to rodent models, as ERP peaks do not map well between species, but e.g. delta-theta power does (see Lee et al 2018 Neuropsychopharmacology; Javitt et all 2018 Schizophrenia research; Gallimore et al 2023 Cereb Ctx). Further, ERP peaks can be influenced by the actual neuroanatomy of an individual (especially for quantifying V1 responses). Time frequency analyses may aid in interpreting the "early negative deflection with a peak latency of 48 ms " finding as well. As it stands, the report is complete, and it would be acceptable if the authors chose to save this type of analysis for a future publication.

    5. Author response:

      The following is the authors’ response to the original reviews.

      We thank you for the time you took to review our work and for your feedback! The main changes to the manuscript are:

      (1) We have performed additional experiments to increase the number of recordings from frontal and occipital electrodes (previously 51 (occipital: O1+O2) and 26 (frontal: Fp1+Fp2), now 133 and 102). The additional data have strengthened many of our results, including for example the trend for a latency difference between occipital and frontal electrodes that was likely underpowered and is now significant (Figure 3E). We have updated all relevant figures to include the additional data (Figures 2–6, Figure S4, Figure S5). None of the main conclusions have changed.

      (2) As suggested by reviewer 1, we have conducted additional experiments to rule out the possibility that the observed effects were driven by the temporal order of open and closed loop sessions (new Figure S6). We also found another 9 participants who were willing to go on the ‘vomit comet’ of six degrees of freedom (6DOF) playback (previously 5, now 14). These data have further strengthened our conclusion that playback halt responses in 4DOF and 6DOF playback are not substantially different (Figure S4).

      (3) To address the point of reviewers 2 and 3, that mismatch negativity (MMN) responses would be larger on temporal electrodes, we conducted additional experiments in which we also recorded from temporal electrodes T3–T6. We have now added a comparison of visuomotor mismatch and MMN responses on T3–T6 electrodes as Figures S8–S9. On all electrodes, visuomotor mismatch responses were larger than MMN responses.

      (4) As suggested by reviewer 1, we have added an analysis of the experience-dependent changes in mismatch responses comparing frontal and occipital responses early and late in the session (new Figure 4).

      (5) As suggested by reviewer 2, we conducted additional experiments in an independent cohort of participants (note, without concurrent EEG) to measure eye movements triggered by visuomotor mismatches. We found eye-movement speed and blink/eye-closure changes, but these had longer latency than visuomotor mismatch responses (Figure S7).

      (6) Finally, as suggested by reviewers 2 and 3, we applied independent component (ICA) and time–frequency analyses to the EEG data. We show these results and explain why they are not applicable or useful in our case in the responses below.

      Please note, during the revision, we found that a part of our analysis used a bandpass of 0.2-100 Hz while a 1-100 Hz bandpass filter was used elsewhere. This has now been standardized to a 1-100 Hz bandpass filter, and the corresponding methods were updated. This resulted in no relevant changes to the figures. Additionally, the 50 Hz band-stop filter was erroneously described in the methods as 49-51 Hz. The filter used was 40-60 Hz, and the methods have been updated to reflect this.

      Reviewer #1 (Public review):

      In this paper, the authors wished to determine human visuomotor mismatch responses in EEG in a VR setting. Participants were required to walk around a virtual corridor, where a mismatch was created by halting the display for 0.5s. This occurred every 10-15 seconds. They observe an occipital mismatch signal at 180 ms. They determine the specificity of this signal to visuomotor mismatch by subsequently playing back the same recording passively. They also show qualitatively that the mismatch response is larger than one generated in a standard auditory oddball paradigm. They conclude that humans therefore exhibit visuomotor mismatch responses like mice, and that this may provide an especially powerful paradigm for studying prediction error more generally.

      Asking about the role of visuomotor prediction in sensory processing is of fundamental importance to understanding perception and action control, but I wasn't entirely sure what to conclude from the present paradigm or findings. Visuomotor prediction did not appear to have been functionally isolated. I hope the comments below are helpful.

      (1) First, isolating visuomotor prediction by contrasting against a condition where the same video stream is played back subsequently does not seem to isolate visuomotor prediction. This condition always comes second, and therefore, predictability (rather than specifically visuomotor predictability) differs. Participants can learn to expect these screen freezes every 10-15 s, even precisely where they are in the session, and this will reduce the prediction error across time. Therefore, the smaller response in the passive condition may be partly explained by such learning. It's impossible to fully remove this confound, because the authors currently play back the visual specifics from the visuomotor condition, but given that the visuomotor correspondences are otherwise pretty stable, they could have an additional control condition where someone else's visual trace is played back instead of their own, and order counterbalanced. Learning that the freezes occur every 10-15 s, or even precisely where they occur, therefore, could not explain condition differences. At a minimum, it would be nice to see the traces for the first and second half of each session to see the extent to which the mismatch response gets smaller. This won't control for learning about the specific separations of the freezes, but it's a step up from the current information.

      In theory, it is correct that the open loop (playback) session is predictable. However, this is relatively unrealistic. The open loop session is a 5-minute sequence that participants have only experienced once before, when they were generating it in the closed loop session a couple of minutes earlier. It is unlikely that participants would remember the entire sequence to a precision of less than a second, which is what they would need to predict the mismatch event. However, the reviewer is correct that it is possible that the mismatch events lose salience with time, for example as a consequence of participants losing interest in the task with time, or by undergoing some form of adaptation. To address this, we repeated the experiments with the sequence of closed and open loop sessions reversed (Figures S6A-S6C), and we analyzed the responses as a function of time within the session (Figures S6D and S6E), as suggested.

      The reversed-order design consisted of (1) open loop session: a playback, in which participants viewed the recorded closed loop session of a previous participant. This was followed by (2) a closed loop session, in which participants actively walked through the tunnel and experienced visuomotor mismatch events. Using this design, we again found that responses in the closed loop session were significantly larger than in the open loop session (Figures S6A-S6C).

      In addition, we analyzed both new and previously collected data as a function of time in the session. We computed moving average responses across 10 mismatch or playback halt trials at different percentages of progress through the paradigm (Figures S6D and S6E). This analysis revealed no consistent experience-dependent changes that could account for the observed differences between closed and open loop session. While there was indeed some form of experience dependent attenuation of visuomotor mismatch responses (see new Figure 4), the difference at the transition from mismatch to playback halt (and vice versa) far exceeded these adaptation effects (Figures S6D and S6E). This analysis was performed only on data from participants for whom we had both closed and open loop sessions and met our inclusion criteria.

      We used a similar analysis to test whether early and late responses within a session systematically differed (new Figure 4). Here, to maximize the chance of finding a difference, we compared early (first five) and late (last five) trials. Behaviorally, participants reduced their walking speed following mismatch events, with a significantly larger reduction during early trials (14.3%) than during late trials (5.7%) (Figure 4A). Neural responses mirrored this pattern primarily on frontal electrodes: frontal activity showed a clear attenuation from early to late trials (Figure 4B), consistent with the reduction in behavioral responses. In contrast, changes on occipital electrodes were much smaller between early and late trials (Figure 4C-4D). Thus, experience-related modulation is substantially stronger in frontal compared to occipital regions.

      In sum, we do not believe that the difference between visuomotor mismatch responses and playback halt responses can be explained by differences in the predictability of mismatch and playback halt events.

      (2) Second, the authors admirably modified their visual-only condition to remove nausea from 6 df of movement (3D position, pitch, yaw, and roll). However, despite the fact it's far from ideal to have nauseous participants, it would appear from the figures that these modifications may have changed the responses (despite some pairwise lack of significance with small N). Specifically, the trace in S3 (6DOF) and 2E look similar - i.e., comparing the visuomotor condition to the visual condition that matches. Mismatch at 4/5 microvolts in both. Do these significantly differ from each other?

      Yes, the 6DOF playback halt response shown in the previous Figure S3 and the mismatch response shown in previous Figure 2E are significantly different (Author response image 1).

      Author response image 1.

      Comparison of visuomotor mismatch response (A) and 6DOF playback halt response (B) from the original submission with statistics of the comparison (C).

      Nevertheless, to strengthen this conclusion, we collected additional data in the 6DOF condition. We show the comparison for participants for whom both closed loop (active) and open loop sessions (6DOF) were recorded within the same recording session (14 participants) in Figure S4. Consistent with our previous findings, visuomotor mismatch responses were significantly larger than 6DOF playback halt responses (Figures S4A-S4C). And we found no evidence of a difference between 6DOF and 4DOF playback halt responses (Figures S4D and S4E).

      (3) It generally seems that if the authors wish to suggest that this paradigm can be used to study prediction error responses, they need to have controlled for the actions performed and the visual events. This logic is outlined in Press, Thomas, and Yon (2023), Neurosci Biobehav Rev, and Press, Kok, and Yon (2020) Trends Cogn Sci ('learning to perceive and perceiving to learn'). For example, always requiring Ps to walk and always concurrently playing similar visual events, but modifying the extent to which the visual events can be anticipated based on action. Otherwise, it seems more accurately described as a paradigm to study the influence of action on perception, which will be generated by a number of intertwined underlying mechanisms.

      We are not entirely sure we understand the point here correctly. If the reviewer is suggesting that visuomotor coupling is not describable by the ideas of predictive processing, we disagree. However, given that the papers the reviewer is pointing to are premised on what seems to be a somewhat unorthodox interpretation of predictive processing when it comes to cortical circuits, we suspect this is contributing to the misunderstanding here. Let us briefly explain. In the two papers, Press and colleagues argue that most experiments cannot distinguish between “predictive cancellation” and “gated suppression”. This is indeed relatively tricky, even when one has single neuron data. The question is, does movement simply suppress sensory feedback (as is likely the case e.g. in the famous example of the cricket), or does movement result in a precise removal of only the self-generated sensory reafference? The first good evidence of the latter happening in any system is quite recent (Keller and Hahnloser, 2009). The premise the authors build their argument on is that the theory posits that “the brain predictively ‘cancels’ expected action outcomes from perception” (from the abstract of one of the papers). This is incomplete. The minimum circuit for predictive processing is composed of 3 neuron types: positive prediction error neurons, negative prediction error neurons, and internal representation neurons. Only the positive prediction error neurons have the predictive cancellation property the authors discuss. This is not the case for either negative prediction error neurons, or for the internal representation neurons. Negative prediction error neurons are excited by predictions and suppressed by sensory input (i.e. if anything, they are “predictively amplified”). This circuit is relatively well characterized in mouse cortex – for a brief summary see (Keller and Mrsic-Flogel, 2018). Note, this is not our idea of course – the original formulation of predictive processing (Rao and Ballard, 1999) was built to explain end-stopping. These are responses to the absence of an expected line that were stronger than would be expected from classical theories (i.e. negative prediction error responses). In mouse visual cortex, we know that a sudden break in the coupling between locomotion and visual flow selectively activates layer 2/3 negative prediction error neurons. Thus, if human cortex also implements a predictive processing like circuit with positive and negative prediction error neurons, we would expect a break in visuomotor coupling to drive a measurable response in visual cortex (by exciting the population of negative prediction error neurons – this is also why we are quite excited by the phase reversal of visual and mismatch responses as this could indicate that mismatch activates negative prediction error neurons first and positive prediction error neurons later, and vice versa for visual stimulation – negative prediction error neurons are more superficial in cortex (O’Toole et al., 2023)). We do indeed find a response over occipital cortex consistent with the negative prediction error response we observe in mouse cortex. The difficulty in distinguishing “predictive cancellation” and “movement driven suppression” comes only when looking at positive prediction error type responses (that are suppressed by predictive inputs) but does not apply to negative prediction error responses. The predictive processing circuit we are testing is the one described by (Keller and Mrsic-Flogel, 2018; Rao and Ballard, 1999), and here the break in visuomotor coupling is a stimulus that drives negative prediction error responses. Note, other authors who have thought about cortical implementations of predictive processing (e.g. (Bastos et al., 2012)) have glossed over the problem that individual neurons cannot trivially encode both positive and negative errors. Prediction errors are a signed quantity. If neurons signal prediction errors in firing rates and are close to zero firing rate at baseline (as is the case in layer 2/3 of cortex), they cannot (short of rather exotic ideas) encode a signed prediction error. Hence such proposals are not very useful for thinking about prediction error responses in cortex. For these reasons, we see no problem with referring to the response as a prediction error response. This is in line with a large body of mouse research (using a nearly identical paradigm) on the topic.

      One could of course argue that gated suppression could also mean that movement relieves suppression. Thus, one could assume that some neurons are suppressed by movement while others are enhanced. If one allows for enough neuron and stimulus specificity in the precision of the movement related suppression and enhancement of responses, the two models (predictive processing and gated suppression) become equivalent, and the discussion becomes semantic. See (Vasilevskaya et al., 2023) for an extended discussion on this point, and the reasons why we think predictive processing is a more useful model than gated suppression (keep in mind, gated suppression only explains the data if we allow for stimulus/neuron specific gain factors of the suppression, in which case the two models are equivalent).

      More minor points:

      (1) I was also wondering whether the authors may consider the findings in frontal electrodes more closely. Within the statistical tests of the frontal electrodes against 0, as displayed in Figure 3c, the insignificance of the effect of Fp2 seems attributable to the small included sample size of just 13 participants for this electrode, as listed in Table S1, in combination with a single outlier skewing the result. The small sample size stands out especially in comparison to the sample size at occipital electrodes, which is double and therefore enjoys far more statistical power. It looks like the selected time window is not perfectly aligned for determining a frontal effect, and also the distribution in 3B looks like responses are absent in more central electrodes but present in occipital and frontal ones. I realise the focus of analysis is on visual processing, but there are likely to be researchers who find the frontal effect just as interesting.

      That is correct; our data in frontal electrodes was likely underpowered. The reason we have fewer data in frontal electrodes is that eye-blink artifacts are particularly strong in frontal channels, resulting in a larger proportion of trials failing to meet our data inclusion criteria. We have now added more data from frontal and occipital electrodes by including additional experimental sessions. In addition, we applied less stringent trial-exclusion criteria, requiring that no artifacts occur within the time window −0.5 to 1 s relative to the event trigger (instead of −0.5 to 2 s). This adjustment allowed us to retain a larger number of trials. As anticipated by the reviewer, this increase in data was sufficient to confirm a significant response to the visuomotor mismatch event at both frontal electrodes (Figure 3C). The expanded dataset also revealed a significant difference in response onset times between occipital and frontal electrodes (Figure 3E), an effect that was not significant previously. In addition, we have included analysis comparing early and late mismatch responses in frontal and occipital electrodes (Figure 4).

      (2) It is claimed throughout the manuscript that the 'strongest predictor (of sensory input) - by consistency of coupling - is self-generated movement'. This claim is going to be hard to validate, and I wonder whether it might be received better by the community to be framed as an especially strong predictor rather than necessarily the strongest. If I hear an ambulance siren, this is an especially strong predictor of subsequent visual events. If I see a traffic light turn red, then yellow, I can be pretty certain what will happen next. Etc.

      This is a statistical argument. Every movement – throughout life – is directly and immediately coupled to sensory feedback and has been throughout evolutionary history. The vast majority of visual input you receive (we estimate, well above 99%) is the consequence of your own movements (e.g. every few 100 ms your eye movements cause a full field change in your visual input). The same is likely true of proprioceptive and somatosensory input – the vast majority is the direct consequence of your own movements (not other people poking you). This is likely different in the auditory system where a much larger fraction of the input is externally driven (depending a bit on how much one likes to talk). But even here the best predictor is self-motion (most non-self-generated sounds one experiences in life are very difficult to predict with millisecond precision). The example the reviewer gives is a good illustration of this. Take the siren that hails the appearance of an ambulance. The siren tells us that an ambulance will appear, but not how it will look, not when exactly it will appear, and with only very low resolution as to where it will appear. Incidentally, if you ask people to draw an ambulance they tend to draw a WWII style white square vehicle with a red cross on the side – a style of ambulance they likely have not ever seen in life. Their visual predictions of what they are about to see are very low resolution. We catastrophically fail at making pixel perfect predictions from learned stimulus associations of this nature. The traffic light example is difficult to compare to visual feedback control of movement as it is a much simpler prediction of a single bit in the form of a change in color of an existing object.

      In addition, consider how often (in life) you have seen an ambulance after hearing it? 100 times maybe? Maybe less. How often have you seen traffic lights change - 10 000 times? 100 000 times? Now consider, how often you have experienced the visual consequences of moving your head or eyes to the left (keep in mind this includes micro saccades) – at a conservative, once per second, that is somewhere on the order of 1 000 000 000. This is not even in the same ballpark. Our brains can certainly learn to make the ambulance and traffic light type predictions - to some extent - but by far the best predictor of sensory feedback (simply by virtue of the physics of how our body interacts with the world) is self-motion.

      We think this is an argument we can make based on first principles, and one that is frequently overlooked in the field, as experiments often focus on training people or animals to learn novel associations that, especially in the case of mice, we often have no idea whether cortical circuits can even learn. We should focus experiments on the predictive systems our brains have evolved since long before the evolutionary appearance of ambulances and traffic lights. We understand that the reviewer may disagree with this, but unless the reviewer has a concrete example of an even stronger predictor (as measured by frequency of experience, consistency in coupling, and precision in timing – we can’t think of one), it is a point we will make.

      (3) The checkerboard inversion response at 48 ms is incredibly rapid. Can the authors comment more on what may drive this exceptionally fast response? It was my understanding that responses in this time window can only be isolated with human EEG by presenting spatially polarized events (cf. c1, e.g., Alilovic, Timmermans, Reteig, van Gaal, Slagter, 2019, Cerebral Cortex).

      We don’t know, but it is not inconsistent with previous reports. For example, compare the “standing” and “fast walking” target ERP responses in Figure 5 of (Gramann et al., 2010). Both here and in our data, the fast response peak is only really apparent in the direct comparison of visual responses recorded while participants were walking to those when they were stationary.

      While we have taken great care to calibrate the timing of the visual display with the EEG recording, one could be worried that the alignment is off by as much as tens of milliseconds. However, even if this were so, one could use P1 as a reference and determine that the fast peak roughly precedes P1 by about 40 ms. Which again would result in a latency of about 50 ms of the fast walking peak (assuming P1 peaks at about 90 ms). In sum, we have added a reference to the previous work (that we found thanks to the reviewer’s comment) but fear we have nothing intelligent to say beyond that.

      Reviewer #2 (Public review):

      Summary:

      This study investigates whether visuomotor mismatch responses can be detected in humans. By adapting paradigms from rodent studies, the authors report EEG evidence of mismatch responses during visuomotor conditions and compare them to visual-only stimulation and mismatch responses in other modalities.

      Strengths:

      (1) The authors use a creative experimental design to elicit visuomotor mismatch responses in humans.

      (2) The study provides an initial dataset and analytical framework that could support future research on human visuomotor prediction errors.

      Weaknesses:

      (1) Methodological issues (e.g., volume conduction, channel selection, lack of control for eye movements) make it difficult to confidently attribute the observed mismatch responses to activity in visual cortical regions.

      (2) A very large portion of the data was excluded due to motion artefacts, raising concerns about statistical power and representativeness. The criteria for trial inclusion and the number of accepted trials per participant appear arbitrary and not justified with reference to EEG reliability standards.

      (3) The comparison across sensory modalities (e.g., auditory vs. visual mismatch responses) is conceptually interesting, but due to the choice of analyzing auditory mismatch responses over occipital channels, it has limited interpretability.

      We have responded to these points in the more detailed itemization below.

      The authors successfully demonstrate that visuomotor mismatch paradigms can, in principle, be applied in human EEG. However, due to the issues outlined above, the current findings are relatively preliminary. If validated with improved methodology, this approach could significantly advance our understanding of predictive processing in the human visual system and provide a translational bridge between rodent and human work.

      Reviewer #2 (Recommendations for the authors):

      Overall, the study addresses an interesting and underexplored question (translation of the visuomotor mismatch responses observed in rodents to humans). Below, please find a list of specific suggestions for improvement

      Introduction:

      (1) "updating internal representations and internal models" - what is the difference between the two, and why is it relevant to this study?

      In a nutshell, an internal model is the synaptic weight matrix that transforms between coding spaces. An internal representation is the activity pattern coding for the current representation. See (Aizenbud et al., 2025; Keller and Mrsic-Flogel, 2018) for more lengthy elaborations. The fact that the mechanism used for representation update can also be used to update internal models (i.e. solve the credit assignment problem) is likely the prime advantage of predictive processing (see work from the Bogacz lab). The relevance to the current study is justifying why predictive processing is a reasonable hypothesis for the function of cortex.

      (2) "Certain stimuli can be predicted from the preceding sensory input" vs. "Predictions can also be based on memory" - how are these two different? Do you mean specific (e.g., long-term associative or episodic) memory types in the latter?

      Correct, this is an arbitrary distinction that primarily makes sense in the light of experimental approaches. In this particular case, we were talking about spatial memory. We made this explicit to increase clarity.

      (3) "the strongest predictor - by consistency of coupling - is self-generated movement"

      (a) Externally induced movement, while not self-generated and therefore not predicted, will also generate sensory coupling, so is it really only about consistency?

      Externally induced movement (as in somebody else moving one’s arm we are not sure this is what the reviewer means) will induce sensory-sensory coupling but not sensorimotor coupling. We might be misunderstanding the point. In case the reviewer means stimuli that trigger movement as in us asking participants to walk, or a sudden startle stimulus that makes them jump in all such cases there are of course sensorimotor predictions. Sensorimotor predictions are driven by efference copies of the motor command thus all movements whether ‘voluntarily’ executed or triggered by an external stimulus will drive sensorimotor predictions. (All of this of course assumes that the predictive processing theory is correct.)

      (b) Do you mean temporal consistency (minimal lags), statistical contingencies (same movements linked to the same sensory inputs), or both? How does it differentiate sensorimotor/visuomotor mismatch responses from responses to incongruent stimuli in sensory modalities (e.g. audiovisual)?

      Both. We have rephrased the sentence to try to make this clearer. See also response to reviewer 1 minor point 2 above.

      How does it differentiate sensorimotor/visuomotor mismatch responses from responses to incongruent stimuli in sensory modalities (e.g. audiovisual)?

      Most cross-modal associations are much less consistent (the exact sound of a glass shattering is always slightly different and impossible for us to predict), and orders of magnitude less frequently experienced, than sensorimotor associations. Again, see also response to reviewer 1 minor point 2 above.

      (4) "Every movement is directly coupled to sensory feedback throughout life"

      This may be the case for proprioceptive and/or somatosensory feedback, but not necessarily for visual feedback (e.g., a mouse moving its tail), which is the topic of the study.

      Correct, there are movements that can be disconnected from visual feedback. Most of the time, most movements however are not, and we are studying one of the more prominent ones that is clearly not decoupled locomotion. The contrast we aim to highlight here very prominently is that there is still this vague idea in the field that you can take a participant, or a mouse, and expose them/it to a few tens or hundreds of trials of some sensory stimulus contingency and then probe for prediction error responses to a pattern only recently if at all learned. Given the life-long experience of subjects and mice, is it really surprising that oddball responses are less strong than a sensorimotor mismatch?

      (5) "However, the overall level of this motor-related activity is much higher than one would expect simply from predictions of visual feedback that are compared against visual input."

      Could you please clarify what one would expect in this case, and/or back it up with citations?

      This is in reference to the fact that there are very strong movement related signals in the mouse visual cortex that persist even when the mouse is in complete darkness. In darkness, movements should not trigger any visual feedback change hence the activity is difficult to explain as a movement related prediction of visual flow. We have rephrased this section of the introduction to make this clearer.

      (6) "The more precise the prediction and comparison, the less motor-related activity should be detectable in visual cortex."

      I think this conflates two issues. A good match between prediction and input would indeed result in sensory attenuation. However, sensory precision, at least in active inference, can upregulate prediction error responses. Since predictions cannot be assumed to be perfect (due to external or internal noise), increased precision may therefore augment activity. See e.g. https://doi.org/10.1007/s10339-013-0571-3

      We agree with the reviewer – the phrasing here was misleading. We do not mean precision in the predictive processing sense, but the precision of sensorimotor control necessary for the behavior. We have rephrased the corresponding section of the manuscript.

      (7) Neither the introduction nor the discussion refers to previous human EEG studies on sensorimotor mismatch responses, where sensory feedback doesn't match motor actions (e.g. https://doi.org/10.3758/s13423-021-01992-z ; https://www.sciencedirect.com/science/article/pii/S0028393214003777 ; https://www.sciencedirect.com/science/article/pii/S0028393219301265).

      The studies cited by the reviewer primarily test how discrete violations of learned action–outcome associations are represented in the brain, whereas our visuomotor mismatch paradigm probes violations of continuous sensorimotor coupling during ongoing action. The paradigms are conceptually different both in how strong the coupling is (lifelong vs. learned in the experiment), and in how prediction errors are likely used (visuomotor control vs. stimulus detection). We have added a brief part to our introduction discussing this.

      Results:

      (1) A very large proportion of the dataset was excluded due to movement artefacts. This is rather problematic as

      (a) the rationale behind finding mismatch responses is that motion-related (neural) signals should affect visual cortical activity, so it's essential to disentangle these neural signals from artefacts;

      Correct, we excluded 21.7% of the total data for visuomotor mismatch paradigm. Note, this percentage compares to other similar studies of EEG recordings during movement (Oliveira et al., 2016). By “problematic”, we assume the reviewer means the fact that we have artefacts, not that we exclude trials with artefacts. The movement artefacts are typically caused by the acceleration during stepping in participants with a heavy gait. None of these movement artefacts are time locked to any of the responses we investigate. Thus, they should just appear as increased levels of noise if not excluded. We don’t understand why the reviewer thinks this is particularly problematic for our analysis/conclusions (beyond the trivial consequence of increasing noise levels that would only cause us to underestimate the strength of the mismatch signals we report).

      (b) the criterion for the number of trials of 15 triggers (per condition?) is arbitrary and lower than widely used in the literature, so authors should demonstrate that this is a sufficient number to observe a measurable ERP even for those participants with 15 triggers;

      We have between 16 and 25 visuomotor mismatch events per participant. Author response image 2 is a selection of single participant examples with different number of trials. The number of mismatch events is limited by the fact that we introduce them approximately every 10 - 15 s and have a total duration of the closed loop session of 5 minutes. Thus, on average, we expect to have 24 mismatch events. But we are not sure we understand the logic of the comment, if we set exclusion too low, we just risk losing a response in the noise. And we clearly have stronger and higher signal to noise mismatch responses with an average of 20 trials compared to visual responses during movement with an average of 40 trials or MMN responses with an average of 28 trials.

      Author response image 2.

      Reliable ERPs can be observed with as few as 16 trials across EEG channels. (A) Histograms showing the distribution of the number of valid mismatch trials per participant for each electrode pair (Fp1–2, C3–4, P3–4, O1–2). (B) Representative EEG responses to visuomotor mismatch events from a single participant, recorded at electrode pairs Fp1–2, C3–4, P3–4, and O1–2. Waveforms were computed using the indicated number of trials (shown above each trace). Dashed vertical red lines are onset and offset of the visuomotor mismatch.

      (c) it seems that the seemingly static "visual" condition resulted in a larger proportion of data rejected due to movement (or, as later mentioned, nausea) than the "visuomotor" condition, which is counterintuitive and needs further explanation;

      This is a misunderstanding the ‘visual paradigm’ the reviewer is referring to are the experiments shown in Figure 1. Here we record visual responses in both sitting and walking participants. In this experiment, as in others, exclusion was primarily driven by part of the paradigm where the subjects were moving. To make this clearer we have added Table S2 to the manuscript that provides an overview of trials excluded by paradigm and session.

      (d) authors mention eye movements as a potential issue, which should be possible to detect from frontal channels. Additionally, it's not entirely clear how many datasets were discarded (the results section mentions 19/48 in the visual condition, then 4+11 in the playback condition - isn't this the same condition?)

      The visual paradigm corresponds to the data shown in Figure 1, in which participants viewed a flipping checkerboard in both a walking and a stationary session. The open loop session is part of the visuomotor paradigm shown in Figure 2, where participants were exposed to a replay of the visual flow that had been self-generated during the preceding closed loop session, including the visual flow halts that constituted visuomotor mismatches in the closed loop session. Please note, to avoid such confusion, we have attempted to standardize the usage of paradigm (visual vs. visuomotor) and session (sitting vs. walking, and closed loop vs. open loop) throughout. In addition, we have added a table to summarize the number of excluded trials by paradigm and session as Table S2 to the manuscript.

      In comments 1 and 2 of the public review, the reviewer also points out that we did not control for eye movements and we presume relatedly claims that we did not use common EEG reliability standards. Regarding the first point, we performed additional experiments in an independent cohort of participants to test whether eye movements could account for the visuomotor mismatch responses. We recorded eye movements during closed loop sessions and found that changes in eye speed (Figure S7A) or blink rate (Figure S7B) following the mismatch stimulus had a longer latency than visuomotor mismatch responses in EEG. This suggests that the visuomotor mismatch response cannot be explained by eye blinks or changes in eye movement speed. Regarding the second point, we are not sure we understand. Trial exclusion based on a fixed voltage threshold of 100 µV is relatively common, and our rejection rates are on par, and particularly on occipital electrodes even lower, with other work in EEG recordings during locomotion or movement (see e.g. (Oliveira et al., 2016)).

      Nevertheless, we did attempt to apply independent component analysis (ICA) based filtering to the EEG data (Delorme and Makeig, 2004). However, these methods were designed for high channel density recordings. With only 8 channels, ICA is unable to reliably isolate eye movement or motion artefact components of the EEG. To illustrate this, we tested two artifact-rejection strategies. In the first approach, components associated with non-neural artifacts (e.g., muscle activity, line noise, eye movements) were removed only if at least 90% of the component’s variance was assigned to a single artifact class (Author response image 3A). In the second, more permissive approach aimed specifically at reducing eye movement artifacts, components were removed if artifact-related activity exceeded 90% for non-eye artifacts, while the threshold for eye-related components was lowered to 60% (Author response image 3C). We lowered the threshold for excluding eye-related components to ensure that EEG signals influenced by eye movements were effectively removed. In both cases - whether the eye-component threshold was set to 90% or 60% - the averaged responses to visuomotor mismatch trials remained largely similar to the previously reported data, despite higher noise in some traces. Interestingly, when we then followed the ICA filtering by our voltage threshold based exclusion with a threshold of 100 µV, the resulting traces closely resembled the patterns described in the paper (Author response image 3B and 3D). Thus, we conclude the nonICA filtered responses are easier to interpret, free of any potential ICA filtering artifacts, and far less parameter choice (of the ICA filtering) dependent.

      Author response image 3.

      Removal of artifacts identified with ICA does not change the visuomotor mismatch responses. (A) Visuomotor mismatch responses recorded from occipital electrodes after artifact correction. Components associated with non-neural artifacts (e.g., muscle activity, line noise, eye movements) were removed only if ≥90% of the component’s variance was attributed to a single artifact class. Solid black line represents the mean, and shading indicates the SEM across participants. Dashed vertical red lines are onset and offset of the visuomotor mismatch. (B) As in A, but excluding trials with amplitudes exceeding 100 µV. (C) As in A, but components were removed if artifact-related activity exceeded 90% for non-ocular artifacts, while the threshold for eye-related components was lowered to 60%. (D) As in C, but excluding trials with amplitudes exceeding 100 µV.

      (2) The finding that mismatch responses are observed at all channels, with differences in amplitudes but not latencies, indicates that volume conduction may affect the results. I would strongly suggest accounting for this using a method appropriate for the very small number of channels, e.g., phase lag index.

      We are not sure we understand. The phase lag index is a method to estimate functional connectivity in a way that corrects for volume conduction (using phase lag). We make no claims about functional connectivity; thus, we are not sure what the reviewer is suggesting we do. The fact that the visual and visuomotor mismatch responses were measurable on all electrodes could indeed be in part explained by volume conduction, but we see no way to estimate the volume conduction contribution. From mouse calcium imaging data, we know that both visual and visuomotor mismatch responses spread across large parts of dorsal cortex (including frontal regions like the ACC).

      With the addition of new data, the latency difference between occipital and frontal electrodes - previously observed only as a trend - is now statistically significant (Figure 3E). Occipital responses emerge earlier than frontal responses, suggesting that mismatch-related activity likely originates in sensory visual regions and subsequently propagates to more frontal areas, as similar to what had been reported in mouse cortex (Heindorf and Keller, 2024).

      (3) The authors compare different types of mismatch responses (including auditory oddballs) in the same set of (occipital) channels, but doesn't this undermine the spatial specificity of the results? Classical auditory mismatch negativity is typically observed over central channels, so weaker amplitudes of auditory mismatch responses in occipital channels are likely trivially explained by modality differences. As such, I'm not convinced that this comparison is informative even in a qualitative manner.

      To address this point, we conducted additional auditory oddball experiments with recordings over the auditory cortex (channels T3, T4, T5, and T6). Given our central reference, these channels should capture the strongest mismatch negativity. The amplitude of the visuomotor mismatch response exceeded that of mismatch negativity on all tested channels (new Figures S8 and S9).

      (4) On a similar note, is the polarity reversal found for visual vs. mismatch responses specific to occipital channels?

      Thank you for this interesting question. In fact, polarity reversal was consistently observed across all recorded channels; this has now been added as a main figure to the manuscript (Figure 5).

      (5) Figure S4C seems to cut off one outlier, and I don't see this outlier included in the boxplot.

      Correct, that is why we describe the boxplots in the figure legend as: “Boxes mark median, quartiles, and range of data not considered outliers.” The axes were now adjusted to include all data points.

      Discussion:

      "A central tenet of the cortical circuit for predictive processing is the split into separate populations of neurons that compute positive and negative prediction errors (Keller and Mrsic-Flogel, 2018; Rao and Ballard, 1999)" - this may be the case for visuomotor mismatch signals or reward prediction errors, but signed PEs do not play a central role in other proposed microcircuits for predictive processing in the perceptual domain (e.g. Bastos)

      Signed prediction errors do not play a central role in proposed cortical microcircuits for predictive processing that do not burden themselves with making a concrete proposal for the implementation of the prediction error computation. The (Bastos et al., 2012) work is a good example of this. The equation for the error term provided in that paper is clearly signed (nothing stops the error from going negative), but no proposal is made for how layer 2/3 excitatory neurons are supposed to signal this quantity. With baseline activity levels close to zero in layer 2/3, there really is only one way to do this, and that is separate populations of negative and positive prediction error neurons. With non-zero baseline firing rate, one could do this bidirectionally around a mean firing rate (as is typically thought of dopaminergic RPE neurons). There are more abstract Bayesian implementations that assume logarithmic transformations that could also implement a prediction error-like system without negative firing rates. But given the absence of any physiological evidence, we will refrain from discussing these. However, most importantly, there is now considerable evidence for the existence of both negative and positive prediction error neurons in layer 2/3 of mouse visual cortex. Thus, by “cortical circuit for predictive processing” we here mean those that make biologically plausible proposals for prediction error computations. Also note, the (Rao and Ballard, 1999) model is probably the prime example for what the reviewer calls a proposed microcircuit for predictive processing in the “perceptual domain”.

      Reviewer #3 (Public review):

      Summary:

      Solyga, Zelechowski, and Keller present a concise report of an innovative study demonstrating clear visuomotor mismatch responses in ambulating humans, using a mobile EEG setup and virtual reality. Human subjects walked around a virtual corridor while EEGs were recorded. Occasionally, motion and visual flow were uncoupled, and this evoked a mismatch response that was strongest in occipitally placed electrodes and had a considerable signal-to-noise ratio. It was robust across participants and could not be explained by the visual stimulus alone.

      Strengths:

      This is an important extension of their prior work in mice, and represents an elegant translation of those previous findings to humans, where future work can inform theories of e.g., psychiatric diseases that are believed to involve disordered predictive processing. For the most part, the authors are appropriately circumspect in their interpretations and discussions of the implications. I found the discussion of the polarity differences they found in light of separate positive and negative prediction errors, intriguing.

      Weaknesses:

      The primary weaknesses rest in how the results are sold and interpreted.

      Most notably, the interpretation of the results of the comparison of visuomotor mismatches to the passive auditory oddball induced mismatch responses is inappropriate, as suboptimal electrode choices, unclear matching of trial numbers, and other factors. To clarify, regarding the auditory oddball portion in Figure 5, the data quality is a concern for the auditory ERPs, and the choice of Occipital electrodes is a likely culprit. Typically, auditory evoked responses are maximal at Cz or FCz, although these contacts don't seem to be available with this setup. In general, caution is warranted in comparing ERP peaks between two different sensory modalities - especially if attention is directed elsewhere (to a silent movie) during one recording and not during the other. The authors discuss this as a purely "qualitative" comparison in the text, which is appreciated, and do acknowledge the limitations within the results section, but the figure title and, importantly, the abstract set a different tone. At least, for comparisons between auditory mismatch and visuomotor mismatch, trial numbers need to be equated, as ERP magnitude can be augmented by noise (which reduces with increased numbers of trials in the average).

      To address this point, we conducted additional auditory oddball experiments with recordings over the auditory cortex (channels T3, T4, T5, and T6). Given our central reference, these channels should capture the strongest mismatch negativity. Nevertheless, the amplitude of the visuomotor mismatch response exceeded that of mismatch negativity on all tested channels (these results are now shown in the new Figures S8 and S9), and the response power was significantly greater for the visuomotor mismatch than for mismatch negativity. Independent of electrode we test, the visuomotor mismatch response has a power 5 to 10 times higher than that of the MMN response. And the number of trials per participant that met quality criteria was comparable between the visuomotor mismatch paradigm (mean = 23 trials) and the auditory mismatch paradigm (mean = 28 trials) (Author response image 4).

      Author response image 4.

      Number of trials included for analysis is comparable between visuomotor and oddball paradigm. (A) Histogram showing the distribution of the number of valid trials per participant for O1-2 electrode pair in visuomotor mismatch paradigm. (B) Same as in A but for deviant stimulus presentations in the oddball paradigm.

      And more generally, the size of the mismatch event at the scalp does not scale one-to-one with the size at the level of the neural tissue. One can imagine a number of variables that impact scalp level magnitudes, which are orthogonal to actual cortex-level activation - the size, spread, and polarity variance of the activated source (which all would diminish amplitude at the scalp due to polyphasic summation/cancelation). The variance of phase to a stimulus across trials (cross trial phase locking) vs magnitude of underlying power - the former, in theory, relates to bottom-up activity and the latter can reflect feedback (which has more variability in time across trials; the distance of the scalp electrode from the activated tissue (which, for the auditory system, would be larger (FCz to superior temporal gyrus) than for the visual system (O1 to V1/2)). None of this precludes the inclusion of the auditory mismatch, which is a strength of the study, but interpretations about this supporting a supremacy of sensory-motor mismatch - regardless of validity - are not warranted. I would recommend changing the way this is presented in the abstract.

      We agree with the point that the EEG response does not need to reflect the total cortical activation. However, the discussion in the abstract (and elsewhere) is in the context of clinical experiments where the underlying cortical activity pattern is irrelevant if it does not trigger a clinically measurable (by EEG in this case) response. The abstract only makes a comparison to MMN implicitly in this sentence “Second, a paradigm that can trigger strong prediction error responses and consequently requires shorter recording times could simplify experiments in a clinical setting.” We are not sure how to phrase this even more carefully – the statement at face value is a truism. The reviewer, we assume, takes exception to the unstated implication that visuomotor prediction errors trigger stronger responses than MMN. Given the data we have, we assume most authors would not consider it an overstatement to make that claim outright.

      Otherwise, the data are of adequate quality to derive most of their conclusions.

      The authors claim that the mismatch responses emanate from within the occipital cortex, but I would require denser scalp coverage or a demonstration of consistent impedances across electrodes and across subjects to make conclusions about the underlying cortical sources (especially given the latencies of their peaks). In EEG, the distribution of voltage on the scalp is, of course, related to but not directly reflective of the distribution of the underlying sources. The authors are mostly careful in their discussion of this, but I would strongly recommend changing the work choice of "in occipital cortex" to "over occipital cortex" or even "posteriorly distributed". Even with very dense electrode coverage and co-registration to MRIs for the generation of forward models that constrain solutions, source localization of EEG signals is very challenging and not a simple problem. Given the convoluted and interior nature of human V1, the ability to reliably detect early evoked responses (which show the mismatch in mouse models) at the scalp in ERP peaks is challenging - especially if one is collapsing ERPs across subjects. And - given the latency of the mismatch responses, I'd imagine that many distributed cortical regions contribute to the responses seen at the scalp.

      This is an excellent point we have rephrased throughout to “over occipital cortex” instead of “in occipital cortex”.

      I think that Figure 3C, but as a difference of visual mismatch vs halting flow alone (in the open loop) might be additionally informative, as it clarifies exactly where the pure "mismatch" or prediction error is represented.

      We performed the analysis as suggested (Author response image 5). Visuomotor mismatch responses are stronger on all electrodes compared to playback halt responses. This difference is also larger in data recorded on occipital electrodes.

      Author response image 5.

      Comparison of the difference between visuomotor mismatch and playback halt on all electrodes. Average response strength was calculated within a 100 ms window centered on the peak of the average visuomotor mismatch response across all electrodes. Boxes mark median, quartiles, and range of data not considered outliers. Each circle represents data from one participant. **: p<0.01, *: p<0.05, Fp1-2: 20 participants, C3-4: 31 participants, P3-4: 35 participants, O1-2: 32 participants.

      As a suggestion, the authors are encouraged to analyse time-frequency power and phase locking for these mismatch responses, as is common in much of the literature (see Roach et al 2008, Schizophrenia Bulletin). This is not to say that doing so will yield insights into oscillations per se, but converting the data to the time-frequency domain provides another perspective that has some advantages. It fosters translations to rodent models, as ERP peaks do not map well between species, but e.g., delta-theta power does (see Lee et al 2018, Neuropsychopharmacology; Javitt et al 2018, Schizophrenia research; Gallimore et al 2023, Cereb Ctx). Further, ERP peaks can be influenced by the actual neuroanatomy of an individual (especially for quantifying V1 responses). Time frequency analyses may aid in interpreting the "early negative deflection with a peak latency of 48 ms " finding as well.

      We have performed time–frequency power and phase-locking analyses for both visual responses (Author response image 6 and Author response image 7) and visuomotor mismatch and playback halt responses (Author response image 8 and Author response image 9), as suggested. We have added the results of these analyses here, as these are not fully developed yet. We may add these to a future publication, for which we would properly want to quantify stability of these effects.

      In brief, time–frequency representations of power did identify potentially interesting differences between walking and sitting sessions in the visual paradigm. Inter-trial phase coherence (ITPC) revealed an early increase in alpha-band synchronization suggesting that phase alignment of alpha oscillations may contribute to the early differences in visual responses between walking and sitting. The same analyses were applied to visuomotor mismatch and playback halt responses. Time–frequency power analysis revealed an increase in delta-band power during visuomotor mismatch, consistent with previous reports linking delta activity to prediction error processing, including reward prediction errors (Cavanagh, 2015), unexpected final words (Webb and Sohoglu, 2025), and visual deviance detection (West et al., 2024). Notably, it appears as if the increase in delta power emerged first over occipital electrodes and appeared later over more frontal electrodes, forming a spatiotemporal gradient of onset across the scalp.

      Delta power changes were markedly reduced in the playback halt responses at the time of visual flow cessation. While some power changes were observed, they occurred primarily at visual flow onset rather than at flow offset. Inter-trial phase coherence analysis further revealed delta-band synchronization over occipital electrodes following visuomotor mismatch, whereas the playback halt response showed strong phase synchronization in both delta and theta bands following visual flow onset.

      Author response image 6.

      Time–frequency representations of EEG power changes during the visual paradigm. (A) Time–frequency maps showing changes in spectral power relative to baseline for electrodes Fp1–2, C3–4, P3–4, and O1–2 following checkerboard reversal in the sitting session. The dashed red vertical line indicates the time of the checkerboard reversal (0 s). (B) As in A, but recorded while participants were walking.

      Author response image 7.

      Inter-trial phase coherence (ITPC) for visual trials during sitting and walking. (A) ITPC across trials for electrode pairs Fp1–2, C3–4, P3–4, and O1–2 following checkerboard reversal in the sitting session. The dashed red vertical line marks the time of the checkerboard reversal (0 s). (B) As in A, but recorded during walking.

      Author response image 8.

      Time–frequency representations of EEG power changes during visuomotor mismatch and playback halt responses. (A) Time–frequency maps showing changes in spectral power relative to baseline for electrodes Fp1–2, C3–4, P3–4, and O1–2 following visuomotor mismatch presentation. Dashed vertical red lines are onset and offset of the visuomotor mismatch. (B) As in A, but for playback halts.

      Author response image 9.

      Inter-trial phase coherence (ITPC) for the visuomotor mismatch and playback halt responses. (A) ITPC across trials for electrode pairs Fp1–2, C3–4, P3–4, and O1–2 following visuomotor mismatch presentation. Dashed vertical red lines are onset and offset of the visuomotor mismatch. (B) As in A, but for playback halts.

      Finally, the sentence in the abstract that this paradigm " can trigger strong prediction error responses and consequently requires shorter recording times would simplify experiments in a clinical setting" is a nice setup to the paper, but the very fact that one third of recordings had to be removed due to movement artifact, and that hairstyle modulates the recording SnR, is reason that this paradigm, using the reported equipment, may have limited clinical utility in its current form. Further, auditory oddball paradigms are of great clinical utility because they do not require explicit attention and can be recorded very quickly with no behavioral involvement of a hospitalized patient. This should be discussed, although it does not detract from the overall scientific importance of the study. The authors should reconsider putting this statement in the abstract.

      We have added a paragraph to the discussion to address these points. Note, we get robust and strong responses with very few trials (Author response image 2). The fact that we need to discard up to 21.7 % of trials due to movement/eye blink artefacts, does little to change the fact that we need much fewer trials and have larger and more robust responses compared to other EEG paradigms. Finally, we understand that sometimes not needing participants to pay attention to the task is useful. However, having a paradigm that is engaging and fun for participants and takes 5 minutes of recording time is probably equally often of advantage.

      Reviewer #3 (Recommendations for the authors):

      Minor points:

      (1) In the Introduction, I'm not sure that the logic comes through as to what the authors aim to illustrate by comparing mice to humans, in terms of precision and "movement modulation". In some cases, the precision of the comparison is referred to, and in others, the precision of the prediction (I think?). I'm not sure if they mean for this to be different or not. Simlarly, on line 81, "If indeed the precision of visuomotor coupling determines the amount of motor modulation of visual responses" - here I'm a little confused, as "amount of motor modulation" to me, the term "modulation" refers to a conditional modifier (if moving, than suppress visual movement resposnes. if not moving, then amplify visual movement repssones) rather than movement driven activity. The way I'm reading it, the authors mean the latter, but I could be misunderstanding.

      We have rephrased this section of the introduction.

      (2) I think it could be helpful, in the sentence starting on line 65, to reiterate that this observation of higher-than-expected motor activity in V1 is in mice (if I'm understanding it correctly). I also found myself tangled up in the difference between motor-related activity in V1 and motor-modulation in V1 in this paragraph.

      We have rephrased this section of the introduction.

      (3) For signal power, was the amplitude squared on individual trials prior to averaging, or after averaging? If prior, it would help with separating amplitude modulations from phase variance.

      In our previous analysis, power was computed by squaring the amplitude after trial averaging (Author response image 10A). We repeated the analysis using the alternative approach in which power was calculated for individual trials and then averaged (Author response image 10B). Although this method yields substantially higher absolute power values, the overall pattern of results remains unchanged: visuomotor mismatch responses continue to show significantly higher power than visual responses. To look at the phase variance we additionally analyze inter-trial phase coherence (Author response image 7 and Author response image 9).

      Author response image 10.

      Visuomotor mismatch responses have more power compared to visual responses. (A) Comparison of power between visuomotor mismatch and visual responses, calculated within a 0 - 0.5 s time window following stimulus onset. Power was computed by squaring the amplitude after trial averaging. Boxes indicate the median and interquartile range, with whiskers showing the range excluding outliers; circles represent data from individual participants. ***p < 0.001. (B) Same comparison as in (A), but with power calculated by squaring the amplitude of individual trials prior to averaging.

      (4) The "the world suddenly flew forward!" response from the participant, I understand, and I believe that it is useful to illustrate a point. I do not understand the "Are you printing this? - Hi Mom! " part of the participant response, and I'm not sure it adds to the paper, beyond amusement, which seems inappropriate.

      One of the authors (the one who did none of the experiments) finds this endlessly hilarious and as the reviewer notes, it might add amusement more generally. “Inappropriate” might be a bit harsh – according to our favorite AI chatbot: “Amusement provides significant mental, physical, and social value by offering a necessary escape from routine, reducing stress, and fostering a connection. It enhances well-being through endorphin-releasing experiences and encourages social bonding, learning, and joy.” Nevertheless, we have censored the offending passage.

      Aizenbud, I., Audette, N., Auksztulewicz, R., Basiński, K., Bastos, A.M., Berry, M., Canales-Johnson, A., Choi, H., Clopath, C., Cohen, U., Costa, R.P., Filippo, R.D., Doronin, R., Errington, S.P., Gavornik, J.P., Gillon, C.J., Granier, A., Hamm, J.P., Hertäg, L., Kennedy, H., Kumar, S., Ladd, A., Ladret, H., Lecoq, J.A., Maier, A., McCarthy, P., Mei, J., Mejias, J., Mikulasch, F., Mudrik, N., Najafi, F., Nejad, K., Nejat, H., Oweiss, K., Petrovici, M.A., Priesemann, V., Rudelt, L., Ruediger, S., Russo, S., Salatiello, A., Senn, W., Sennesh, E., Sima, S., Uran, C., Vasilevskaya, A., Vezoli, J., Vinck, M., Westerberg, J.A., Wilmes, K., Xiong, Y.S., 2025. Neural mechanisms of predictive processing: a collaborative community experiment through the OpenScope program. https://doi.org/10.48550/arXiv.2504.09614

      Bastos, A.M., Usrey, W.M., Adams, R.A., Mangun, G.R., Fries, P., Friston, K.J., 2012. Canonical microcircuits for predictive coding. Neuron 76, 695–711. https://doi.org/10.1016/j.neuron.2012.10.038

      Cavanagh, J.F., 2015. Cortical delta activity reflects reward prediction error and related behavioral adjustments, but at different times. NeuroImage 110, 205–216. https://doi.org/10.1016/j.neuroimage.2015.02.007

      Delorme, A., Makeig, S., 2004. EEGLAB: an open source toolbox for analysis of single-trial EEG dynamics including independent component analysis. J. Neurosci. Methods 134, 9–21. https://doi.org/10.1016/j.jneumeth.2003.10.009

      Gramann, K., Gwin, J.T., Bigdely-Shamlo, N., Ferris, D.P., Makeig, S., 2010. Visual evoked responses during standing and walking. Front. Hum. Neurosci. 4, 202. https://doi.org/10.3389/fnhum.2010.00202

      Heindorf, M., Keller, G.B., 2024. Antipsychotic drugs selectively decorrelate long-range interactions in deep cortical layers. eLife 12, RP86805. https://doi.org/10.7554/eLife.86805

      Keller, G.B., Hahnloser, R.H.R., 2009. Neural processing of auditory feedback during vocal practice in a songbird. Nature 457, 187–90. https://doi.org/10.1038/nature07467

      Keller, G.B., Mrsic-Flogel, T.D., 2018. Predictive Processing: A Canonical Cortical Computation. Neuron 100, 424–435. https://doi.org/10.1016/j.neuron.2018.10.003

      Oliveira, A.S., Schlink, B.R., Hairston, W.D., König, P., Ferris, D.P., 2016. Proposing Metrics for Benchmarking Novel EEG Technologies Towards Real-World Measurements. Front. Hum. Neurosci. 10, 188. https://doi.org/10.3389/fnhum.2016.00188

      O’Toole, S.M., Oyibo, H.K., Keller, G.B., 2023. Molecularly targetable cell types in mouse visual cortex have distinguishable prediction error responses. Neuron 111, 2918-2928.e8. https://doi.org/10.1016/j.neuron.2023.08.015

      Rao, R.P.N., Ballard, D.H., 1999. Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive-field effects. Nat. Neurosci. 2, 79–87. https://doi.org/10.1038/4580

      Vasilevskaya, A., Widmer, F.C., Keller, G.B., Jordan, R., 2023. Locomotion-induced gain of visual responses cannot explain visuomotor mismatch responses in layer 2/3 of primary visual cortex. Cell Rep. 42, 112096. https://doi.org/10.1016/j.celrep.2023.112096

      Webb, J.M., Sohoglu, E., 2025. Cortical tracking of prediction error during perception of connected speech. https://doi.org/10.1101/2025.07.18.665498

      West, C.L., Bastos, G., Duran, A., Nadeem, S., Ricci, D., Groves, A.M.R., Wargo, J.A., Peterka, D.S., Leeuwen, N.V., Hamm, J.P., 2024. A lasting impact of serotonergic psychedelics on visual processing and behavior. https://doi.org/10.1101/2024.07.03.601959

    1. eLife Assessment

      This important study convincingly shows that Vibrio bacteria act as predators of ecologically significant algae that contribute to harmful blooms in the lab and in their natural habitat, and that predation is induced by starvation. The authors suggest a working model that can be the basis for future work on this system. The study will be very impactful to those interested in the diversity of microbial predator-prey interactions and controlling toxic algal bloom.

    2. Reviewer #1 (Public review):

      [Editors' note: this version has been assessed by the Reviewing Editor without further input from the original reviewers. We appreciate the revisions and the authors addressed all of the remaining minor concerns listed by the reviewers. We have no further suggestions for revision.]

      Summary:

      Rolland and colleagues investigated the interaction between Vibrio bacteria and Alexandrium algae. The authors found a correlation between the abundance of the two in the Thau Lagoon and observed in the laboratory that Vibrio grows to higher numbers in the presence of the algae than in monoculture. Timelapse imaging of Alexandrium in coculture with Vibrio enabled the authors to observe Vibrio bacteria in proximity to the algae and subsequent algae death. The authors further determine the mechanism of the interaction between the two and point out similarities between the observed phenotypes and predator prey behaviours across organisms.

      Strengths:

      The study combines field work with mechanistic studies in the laboratory and uses a wide array of techniques ranging from co-cultivation experiments to genetic engineering, microscopy and proteomics. Further, the authors test multiple Vibrio and Alexandria species and claim a wide spread of the observed phenotypes.

      Comments on revisions:

      I thank the authors for their additional work on the manuscript. My comments were addressed to my satisfaction.

    3. Reviewer #2 (Public review):

      Goal summary:

      The authors sought to (i) demonstrate correlations between the dynamics of the dinoflagellate Alexandrium pacificum and the bacterim Vibrio atlanticus in natural populations, ii) demonstrate the occurrence of predation in laboratory experiments, iii) demonstrate that predation is induced by predator starvation, and iv) test for effects of quorum sensing and iron-uptake genes on the predation process.

      Strengths include:

      - Data indicating correlated dynamics in a natural environment that increase the motivation for study of in vitro interactions<br /> - Experimental design allowing clear inference of predation based on population counts of both prey and predators in addition to microscopy-based evidence<br /> - Supplementation of population-level data with molecular approaches to test hypotheses regarding possible involvement of quorum sensing and iron update in predation

      Weaknesses include:

      - A quantitative analysis of effects of manipulating V. atlanticus density on rates of predation would have been valuable

      Appraisal:

      The authors convincingly demonstrate that V. atlanticus can prey on A. pacificum, provide strongly suggestive evidence that such predation is induced by starvation and clearly demonstrate that both iron availability and correspondingly the presence of genes involved in iron uptake strongly influence the efficacy of predation.

      Discussion of impact:

      This paper will interest those interested in the diversity of forms of microbial predation and how microbial predatory behavior responds to environmental fluctuations. It will also interest those investigating bacteria-algae interactions and potential ecological controls of algal blooms. It may also interest researchers of microbial cooperation in light of the suggestion of communication between predator cells.

    4. Author response:

      The following is the authors’ response to the previous reviews

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      Rolland and colleagues investigated the interaction between Vibrio bacteria and Alexandrium algae. The authors found a correlation between the abundance of the two in the Thau Lagoon and observed in the laboratory that Vibrio grows to higher numbers in the presence of the algae than in monoculture. Timelapse imaging of Alexandrium in coculture with Vibrio enabled the authors to observe Vibrio bacteria in proximity to the algae and subsequent algae death. The authors further determine the mechanism of the interaction between the two and point out similarities between the observed phenotypes and predator prey behaviours across organisms.

      Strengths:

      The study combines field work with mechanistic studies in the laboratory and uses a wide array of techniques ranging from co-cultivation experiments to genetic engineering, microscopy and proteomics. Further, the authors test multiple Vibrio and Alexandria species and claim a wide spread of the observed phenotypes.

      Comments on revisions:

      I thank the authors for their additional work on the manuscript. My comments were addressed to my satisfaction.

      Dear Reviewer #1, we thank you for your careful evaluation of our manuscript and for the time and effort you dedicated to this review. We are pleased that the revised version has addressed your concerns to your satisfaction.

      Reviewer #2 (Public review):

      Goal summary

      The authors sought to (i) demonstrate correlations between the dynamics of the dinoflagellate Alexandrium pacificum and the bacterim Vibrio atlanticus in natural populations, ii) demonstrate the occurrence of predation in laboratory experiments, iii) demonstrate that predation is induced by predator starvation, and iv) test for effects of quorum sensing and iron-uptake genes on the predation process.

      Strengths include

      - Data indicating correlated dynamics in a natural environment that increase the motivation for study of in vitro interactions

      - Experimental design allowing clear inference of predation based on population counts of both prey and predators in addition to microscopy-based evidence

      - Supplementation of population-level data with molecular approaches to test hypotheses regarding possible involvement of quorum sensing and iron update in predation

      Weaknesses include

      - A quantitative analysis of effects of manipulating V. atlanticus density on rates of predation would have been valuable

      - Lack of clarity in some of the methodological descriptions

      Appraisal

      The authors convincingly demonstrate that V. atlanticus can prey on A. pacificum, provide strongly suggestive evidence that such predation is induced by starvation and clearly demonstrate that both iron availability and correspondingly the presence of genes involved in iron uptake strongly influence the efficacy of predation.

      Discussion of impact

      This paper will interest those interested in the diversity of forms of microbial predation and how microbial predatory behavior responds to environmental fluctuations. It will also interest those investigating bacteria-algae interactions and potential ecological controls of algal blooms. It may also interest researchers of microbial cooperation in light of the suggestion of communication between predator cells.

      Dear Reviewer #2, we sincerely thank you for the time you devoted to this second review of our manuscript. We greatly appreciate your thoughtful comments, which helped us further improve the clarity and precision of the manuscript. All your additional recommendations have been carefully considered and addressed in the revised version and in our responses below.

      Recommendations for the authors:

      Reviewer #2 (Recommendations for the authors):

      (2) The authors' reference to Fig. 4a did not address our concern about density potentially affecting the outcomes shown in Fig. 3. Fig. 4a does not provide any quantitative effects of manipulating Vibrio density. But the new density numbers the authors added in response to point (33) do seem to address our concern, because Vibrio densities become lower in the older cultures, excluding the possibility that the increased predation in older cultures might have been due higher Vibrio densities. We think this should be stated explicitly.

      (33) See point (2) above. We think the authors should explicitly state in the text that the increased predation in older cultures was not due higher Vibrio densities in those older cultures, referring to their data.

      As recommended by Reviewer#2, we added the sentence “Importantly, Vibrio densities decreased with culture age, ruling out the possibility that the stronger predation observed in older cultures was driven by higher bacterial densities” in the results section “Attack of A. pacificum ACT03 is activated by V. atlanticus LGP32 starvation.”

      (45) Is it known that bacterial predators collectively feed more on other bacteria than on microbial eukaryotes in natural habitats? While this certainly seems most likely, it's stated as fact and so should either the statement should be supported with relevant citations or phrased as a likely hypothesis.

      As suggested, we rephrased this sentence “Predatory bacteria are found in a wide variety of environments and are commonly described as feeding on other bacteria, although some cases of predation on microbial eukaryotes have also been hypothesized” in the discussion section.

      (46) Perhaps "Conceiving predators as free-living organisms that kill other organisms and feed on them, this study suggest that Vibrios engage in a novel form of predation in which they kill and feed on algae."

      The reference to 'developing' a predator behavior is not clear. What is meant by 'develop'? It seems unnecessary.

      The use of italics when writing Vibrio is inconsistent.

      We agree that the reference to “developing” a predatory behavior was unclear and unnecessary. We therefore revised the sentence as follows: “Conceiving predators as free-living organisms that kill other organisms and feed on them, this study suggests that Vibrio engages in a novel form of predation in which it kills and feeds on algae.” We also corrected the inconsistent use of italics for Vibrio throughout the manuscript.

      (48) The authors might wish to revise this sentence, as although M. xanxthus does have contact-dependent killing mechanism, it is our understanding that both Lysobacter and myxobacteria can kill some prey at a distance with diffusible secretions.

      The sentence “These bacteria must be in close proximity to their prey in order to cause lysis and utilize their biomass, regardless of the prey's species” was replaced by “These bacteria may require close proximity to their prey to cause lysis and utilize their biomass, although some can also kill prey at a distance through diffusible secretions”.

      (50) Why not directly say 'predatory behavior?

      We totally agree and have reworded the sentence.

      Line by line feedback:

      28 '...the phycosphere, an interface ...'

      We agree and have revised the wording.

      24 'In the attack stage, Vibrios...'

      This sentence has been rephrased as recommended.

      35 surrounds -> surround

      The correction has been done.

      36 The lysis is induced by the cells not by the 'stage'. We would rephrase to 'in which the lysis and consumption of the dinoflagellates occurs'

      This sentence has been rephrased as recommended.

      41 'a new mechanism that could to be involved' -> 'a new mechanism that could be involved ...'

      The correction has been done.

      61 forms

      The correction has been done.

      98 'the role...in'

      The suggested correction has been performed.

      103 'Qpcr' -> 'qPCR'

      Thank you for spotting this typo. “Qpcr” was corrected to “qPCR” in the manuscript.

      125 Misplaced punctuation

      The punctuation was corrected.

      152 The use of '.' vs 'x' to indicate multiplication when writing numbers is inconsistent. In some cases both are missing.

      Numbers have been corrected throughout the manuscript.

      231 I would rephrase 'poor nutrient stress' to 'little nutrient stress' or 'no nutrient stress'

      The rephrasing was carried out as suggested.

      310 R and used packages are not cited

      We added the citation (R Core Team, 2024). Linear models, QQ plots (which are part of linear models), tests, and AICs are included in R by default and are credited to the R Core Team.

      The sentence “Statistical analyses were performed using R 3.6.3 software” was replaced by “Statistical analyses were performed using R 3.6.3 software (R Core Team, 2024) using Rstudio”.

      358 'are capable of simultaneously attacking'

      The expression “are capable of simultaneously attacking” was revised in the manuscript to improve clarity and readability.

      366 'exponential growth phase'

      We have corrected the wording to “exponential growth phase” in the revised manuscript.

      430 The large difference in incubation time between the sea-water vs nutrient-rich treatments and use of different media are unfortunate. These additional variables compromise the ability to directly ascribe observed differences to starvation.

      We agree, the sentence “The comparative analysis of the proteome of V. atlanticus LGP32 incubated 60 h in artificial seawater (ENSW) versus V. atlanticus LGP32 grown 12 h in Zobell nutrient-rich medium revealed 10 proteins modulated by nutrient stress (Fig. S2)” was replaced by “The comparative analysis of the proteome of V. atlanticus LGP32 incubated 60 h in artificial seawater (ENSW) versus V. atlanticus LGP32 grown 12 h in Zobell nutrient-rich medium revealed 10 proteins that were differentially abundant under these two contrasting conditions (Fig. S2)”

      443 Somewhat unclear sentence. I would rephrase this to "Remarkably, of the 10 proteins identified by proteomic analysis and eliminated by mutation, only elimination of PvuB prevented V. atlanticus from attacking A. pacificum ACT03."

      To clarify this point, the sentence “Remarkably, among the 10 proteins identified by proteomic analysis only V. atlanticus LGP32 mutant lacking pvuB failed to attack A. pacificum ACT03 (Fig. 4C; ANOVA p <0.001)” was replaced by “Remarkably, of the 10 proteins identified by proteomic analysis and eliminated by mutation, only elimination of PvuB prevented V. atlanticus from attacking A. pacificum ACT03 (Fig. 4C; ANOVA p <0.001).”

      445 'attack simultaneously' -> 'simultaneously attack'

      The suggested modification has been done.

      450 H3BO4 is written as Boron later, it would be good to call it boron here as well so that it is easier to make the connection for the reader.

      We agree, we modified the manuscript and called it boron.

      459 'no linked' -> 'no link'

      The text was modified accordingly.

      483 'which induces' -> 'which induce'

      The correction has been made.

      519 The use of Vibrio atlanticus and V. atlanticus is inconsistent within the text.

      We have checked and modified the manuscript in accordance with the recommendations.

      807-808 The use of the phrase 'Akaike information criterion (AICc) models' is confusing. Aren't these models just generalized linear models? It should be rephrased to make clear that the AICc is just a test that is used to select which model to use.

      We clarified this point by revising Figure 1 legend. The sentences “(C) Result of Akaike information criterion (AICc) models tested to explain the mean value of degraded Alexandrium cells (dead cells) in spring. (D) Wald test of the AICc model attributing the mean value of degraded cells of Alexandrium in spring to free Vibrio “were replaced by “(C) Results of the Akaike Information Criterion (AICc) test conducted to select a model for explaining the mean value of dead Alexandrium (degraded cells) in spring. (D) Wald test of the AICc model explaining the mean value of dead Alexandrium in spring by free Vibrio”

      827 The chronological sequence of snapshots is not very clear. Perhaps it would be clearer if pictures over a shorter timeframe were used to clearly show the gathering of the V. atlanticus cells near the algal cells.

      To address this point, we removed the first and the last 14 seconds of the snapshots to clearly show the gathering of the V. atlanticus cells near the algal cells, and we added an arrow on Fig. 2D to indicate the chronological order.

    1. eLife Assessment

      This important study describes a novel Bayesian psychophysical approach that efficiently measures how well humans can discriminate between colors across the entire isoluminant plane. The evidence was considered compelling, as it included successful model validation against hold-out data and published datasets. This approach could prove to be of use to color vision scientists, as well as to those who employ computational psychophysics and attempt to model perceptual stimulus fields with smooth variations over coordinate spaces.

    2. Reviewer #1 (Public review):

      Summary:

      This paper presents an ambitious and technically impressive attempt to map how well humans can discriminate between colours across the entire isoluminant plane. The authors introduce a novel Wishart Process Psychophysical Model (WPPM) - a Bayesian method that estimates how visual noise varies across colour space. Using an adaptive sampling procedure, they then obtain a dense set of discrimination thresholds from relatively few trials, producing a smooth, continuous map of perceptual sensitivity. They validate their procedure by comparing actual and predicted thresholds at an independent set of sample points. The work is a valuable contribution to computational psychophysics and offers a promising framework for modelling other perceptual stimulus fields more generally.

      Strengths:

      The approach is elegant and well-described, and the data are of high quality. The writing throughout is clear and the figures are clean (elegant in fact) and do a good job of explaining how the analysis was performed. The whole paper is tremendously thorough and the technical appendices and attention to detail are impressive (for example, a huge amount of data about calibration, variability of the stim system over time etc). This should be a touchstone for other papers that use calibrated colour stimuli.

      Comments on revised version:

      The authors have addressed all the issues I raised to my satisfaction.

    3. Reviewer #3 (Public review):

      Summary:

      This study presents a powerful and rigorous approach for characterizing stimulus discriminability throughout a sensory manifold, and is applied to the specific context of predicting color discrimination thresholds across the chromatic plane.

      Strengths:

      Color discrimination has played a fundamental role in studies of human color vision and for color applications, but as the authors note, remains poorly characterized. The study leverages the assumption that thresholds should vary smoothly and systematically within the space, and validates this with their own tests and comparisons with previous studies.

      Comments on revised version:

      My comments have been addressed.

    4. Author response:

      The following is the authors’ response to the original reviews.

      We would like to thank the editors and the reviewers for the thorough and insightful comments and suggestions. Addressing them has strengthened our manuscript. We have carefully addressed all reviewer comments, as described in detail below, as well as additional comments we received from others. In addition, we made two substantive updates to the manuscript:

      (1) We improved the estimation of uncertainty in the model predictions by computing 95% confidence intervals using 120 bootstrapped datasets (instead of the 100% of 10 bootstrapped datasets in the original submission) to match the number of bootstrap for the validation dataset.

      (2) We selected a slightly different hyperparameter value based on follow-up analyses suggested by Reviewer 1, which provided very useful information.

      Importantly, none of these changes alter the main results or conclusions of the paper.

      Beyond these changes and those outlined below, we also worked to improve the clarity of the prose throughout as well as added various additional citations to the literature.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      This paper presents an ambitious and technically impressive attempt to map how well humans can discriminate between colours across the entire isoluminant plane. The authors introduce a novel Wishart Process Psychophysical Model (WPPM) - a Bayesian method that estimates how visual noise varies across colour space. Using an adaptive sampling procedure, they then obtain a dense set of discrimination thresholds from relatively few trials, producing a smooth, continuous map of perceptual sensitivity. They validate their procedure by comparing actual and predicted thresholds at an independent set of sample points. The work is a valuable contribution to computational psychophysics and offers a promising framework for modelling other perceptual stimulus fields more generally.

      Strengths:

      The approach is elegant and well-described (I learned a lot!), and the data are of high quality. The writing throughout is clear, and the figures are clean (elegant in fact) and do a good job of explaining how the analysis was performed. The whole paper is tremendously thorough, and the technical appendices and attention to detail are impressive (for example, a huge amount of data about calibration, variability of the stim system over time, etc). This should be a touchstone for other papers that use calibrated colour stimuli.

      Weaknesses:

      Overall, the paper works as a general validation of the WPPM approach. Importantly, the authors validate the model for the particular stimuli that they use by testing model predictions against novel sample locations that were not part of the fitting procedure (Figure 2). The agreement is pretty good, and there is no overall bias (perhaps local bias?), but they do note a statistically-significant deviation in the shape of the threshold ellipses. The data also deviate significantly from historical measurements, and I think the paper would be considerably stronger with additional analyses to test the generality of its conclusions and to make clearer how they connect with classical colour vision research. In particular, three points could use some extra work:

      (1) Smoothness prior.

      The WPPM assumes that perceptual noise changes smoothly across colour space, but the degree of smoothness (the eta parameter) must affect the results. I did not see an analysis of its effects - it seems to be fixed at 0.5 (line 650). The authors claim that because the confidence intervals of the MOCS and the model thresholds overlap (line 223), the smoothing is not a problem, but this might just be because the thresholds are noisy. A systematic analysis varying this parameter (or at least testing a few other values), and reporting both predictive accuracy and anisotropy magnitude, would clarify whether the model's smoothness assumption is permitting or suppressing genuine structure in the data. Is the gamma parameter also similarly important? In particular, does changing the underlying smoothness constraint alter the systematic deviation between the model and the MOCS thresholds? The authors have thought about this (of course! - line 224), but also note a discrepancy (line 238). I also wonder if it would be possible to do some analysis on the posterior, which might also show if there are some regions of color space where this matters more than others? The reason for doing this is, in part, motivated by the third point below - it's not clear how well the fits here agree with historical data.

      Thank you for raising this important point. We have now added analyses of the effects of the two smoothness-related hyperparameters, ε and γ (see Appendix 10).

      First, we swept a range of values for each hyperparameter (ε: 0.1 – 1; γ: 0.000001 – 0.003) and evaluated model performance using 5-fold cross-validation of the dataset used to fit the WPPM, quantifying predictive accuracy on held-out test data. We used the mean negative log likelihood averaged across the held-out data in the cross validation as our measure of predictive accuracy (Figs. S27-31).

      The two hyperparameters affect cross-validation accuracy in a similar manner. With γ fixed at 0.0003, predictive accuracy is highest for ε in the range of approximately 0.3–0.5 and drops quite rapidly for ε < 0.3. We attribute this drop to oversmoothing. Cross-validation accuracy also decreases, albeit more gradually, for ε > 0.5. We attribute this to increased variance due to undersmoothing relative to the power of our datasets. Similarly, with ε fixed at 0.4, predictive accuracy is highest for γ values between approximately 0.0001 and 0.001, declines rapidly for smaller γ (oversmoothing), and more slowly for larger γ (undersmoothing).

      Second, we examined how the hyperparameter ε affected the agreement between the WPPM fit and the MOCS validation data. Specifically, at each ε, for each participant, we computed the linear regression between WPPM thresholds and validation thresholds at 25 reference locations. Then, we examined the slope and correlation coefficient of all participants as a function of ε. We found a classic bias–variance tradeoff. Excessive smoothness introduces bias by failing to capture structure in the data, whereas insufficient smoothness increases variance in model predictions. These results further support a choice of ε = 0.4 as lying near the optimal balance between bias and variance (Fig. S32).

      Based on these analyses, we selected for the final analysis ε = 0.4, slightly smaller than the preregistered value used in the original submission (0.5), while retaining the original value of γ (0.0003).

      We now discuss these reasons for changing this value in the revision, as well as provide a more general discussion of the importance and practicalities of hyperparameter choice in Bayesian approaches to analyzing data (Discussion / Prior specification).

      (2) Comparison with simpler models. It would help to see whether the full WPPM is genuinely required. Clearly, the data (both here and from historical papers) require some sort of anisotropy in the fitting - the sensitivities decrease as the stimuli move away from the adaptation point. But it's >not< clear how much the fits benefit from the full parameterisation used here. Perhaps fits for a small hierarchy of simpler models - starting with isotropic Gaussian noise (as a sort of 'null baseline') and progressing to a few low-dimensional variants - would reveal how much predictive power is gained by adding spatially varying anisotropy. This would demonstrate that the model's complexity is justified by the data.

      In the 5-fold cross-validation analysis described above (and now presented in Appendix 10), we found that when ε or γ is small, the stronger smoothness constraint leads to threshold ellipses that are nearly identical to each other across color space. Under these conditions, model predictions show poor accuracy on held-out test data and lead to poor predictions of the validation data. This observation addresses the underlying point raised by the reviewer, albeit in a different way than suggested: it shows that a degree of spatially varying anisotropy is necessary to capture the structure of the data. We now make this point in the paper (Discussion / Prior specification).

      More broadly, we employed the WPPM as a prior that imposed smoothness but not much other obvious structure, and used this to learn about the psychometric field. We are currently working to understand how we can best use our current data to improve the prior we would apply to future measurements. There are a number of approaches to this. One would be to seek a parametric mechanistic model that can describe the current data, and to the extent this is possible formulate prior distributions over the parameters of the model. The results reported here thus provide a foundation for deriving and evaluating more structured priors that would even more efficiently leverage future datasets, but with the feature that they impose more structure. We have added this perspective to the Discussion / Extensions of the WPPM framework.

      (3) Quantitative comparison to historical data. The paper currently compares its results to MacAdam, Krauskopf & Karl, and Danilova & Mollon only by visual inspection. It is hard to extract and scale actual data from historical papers, but from the quality of the plotting here, it looks like the authors have achieved this, and so quantitative comparisons are possible. The MacAdam data comparisons are pretty interesting - in particular, the orientations of the long axes of the threshold ellipses do not really seem to line up between the two datasets - and I thought that the orientation of those ellipses was a critical feature of the MacAdam data. Quantitative comparisons (perhaps overall correlations, which should be immune to scaling issues, axis-ratio, orientation, or RMS differences) would give concrete measures of the quality of the model. I know the authors spend a lot of time comparing to the CIE data, and this is great.... But re-expressing the fitted thresholds in CIE or DKL coordinates, and comparing them directly with classical datasets, would make the paper's claims of "agreement" much more convincing.

      Although we are sympathetic to this request, we have chosen not to implement the sort of quantitative comparison requested by the reviewer. The reason is that an important feature of color thresholds is that they depend on the spatial (e.g. Kelly, 1974; Poirson & Wandell, 1996; Danilova & Mollon, 2025) and temporal (e.g. Kelly, 1974) properties of the stimuli, and on the observer’s state of adaptation (e.g. Loomis & Berger, 1979; Krauskopf & Gegenfurtner, 1992). Because (as the reviewer notes below) the spatial and temporal properties of our stimuli were not matched to those of the comparison datasets, our purpose in making these comparisons was to examine qualitative agreement, as well as to situate our results in the literature and to demonstrate that our approach allows us to read out thresholds around the references and in the color spaces used in other studies. We would not expect detailed quantitative agreement with the current dataset because of differences in stimuli.

      As a consequence of this, we think we would be overreaching to quantify the differences between our data and classic datasets. This consideration is particularly important for the MacAdam measurements, where because of the matching adjustment procedure used, the observer’s state of adaptation is likely to have varied (by amounts that are difficult to estimate) from one reference to the next (e.g. Danilova & Mollon, 2025). We have clarified the manuscript with respect to these points (Results / Comparison with previous measurements).

      A point to make on this topic is that an important and interesting future direction that emerges from our work is to develop efficient methods to characterize the dependence of the full discrimination field on ancillary variables, such as those that describe spatial and temporal properties and/or the state of adaptation, which we now also mention in the paper (Discussion / Implications for the mechanisms of color perception). Although not the primary motivation, doing so would enable comparison of data with a wider range of studies.

      We do agree that the comparisons to CIELAB predictions work better when we express them in CIELAB, and have now done so (Fig. 3D; Fig. S24-S26).

      Kelly, D. H. (1974). "Spatio-temporal frequency characteristics of color-vision mechanisms." Journal of the Optical Society of America 64(7): 983–990.

      Poirson, A. B. and B. A. Wandell (1996). "Pattern-color separable pathways predict sensitivity to simple colored patterns " Vision Research 36(4): 515–526.

      Danilova, M. V. and J. D. Mollon (2025). "Effect of stimulus size on chromatic discrimination." Journal of the Optical Society of America A 42(5).

      Loomis, J. M. and T. Berger (1979). "Effects of chromatic adaptation on color discrimination and color appearance." Vision Research 19(8): 891–901.

      Krauskopf, J., Gegenfurtner, K. (1992). "Color discrimination and adaptation." Vision Research 32(11): 2165–2175.

      Overall, this is a creative and technically sophisticated paper that will be of broad interest to vision scientists. It is probably already a definitive method paper showing how we can sample sensitivity accurately across colour space (and other visual stimulus spaces). But I think that until the comparison with historical datasets is made clear (and, for example, how the optimal smoothness parameters are estimated), it has slightly less to tell us about human colour vision. This might actually be fine - perhaps we just need the methods?

      Related to this, I'd also note that the authors chose a very non-standard stimulus to perform these measurements with (a rendered 3D 'Greebley' blob). This does have the advantage of some sort of ecological validity. But it has the significant disadvantage that it is unlike all the other (much simpler) stimuli that have been used in the past - and this is likely to be one of the reasons why the current (fitted) data do not seem to sit in very good agreement with historical measurements.

      As the reviewer notes, our stimuli head in the direction of ecological validity (see also Hedjar et al., 2025) and indeed this was a consideration when we chose them, at the cost of limiting the degree of comparison we can make with prior studies (as discussed above). Another reason we chose our stimuli is that they enable the current data to be used as a basis of comparison with stimuli where we add specularity, change object shape, and vary object pose in the future. These manipulations are not possible with flat matte patches. Such experiments are of interest to us, as they will tell us about how effectively color may be used to differentiate stimuli in cases where other ecologically important variables co-vary. We now mention this motivation in the paper (Results / Task and Stimuli).

      Hedjar, L., M. Toscani and K. R. Gegenfurtner (2025). "Importance of hue: color discrimination of three-dimensional objects and two-dimensional discs." Journal of the Optical Society of America A 42(5).

      Reviewer #2 (Public review):

      Summary:

      Hong et al. present a new method that uses a Wishart process to dramatically increase the efficiency of measuring visual sensitivity as a function of stimulus parameters for stimuli that vary in a multidimensional space. Importantly, they have validated their model against their own hold-out data and against 3 published datasets, as well as against colour spaces aimed at 'perceptual uniformity' by equating JNDs. Their model achieves high predictive success and could be usefully applied in colour vision science and psychophysics more generally, and to tackle analogous problems in neuroscience featuring smooth variation over coordinate spaces.

      Strengths:

      (1) This research makes a substantial contribution by providing a new method to very significantly increase the efficiency with which inferences about visual sensitivity can be drawn, so much so that it will open up new research avenues that were previously not feasible. Secondly, the methods are well thought out and unusually robust. The authors made a lot of effort to validate their model, but also to put their results in the context of existing results on colour discrimination, transforming their results to present them in the same colour spaces as used by previous authors to allow direct comparisons. Hold-out validation is a great way to test the model, and this has been done for an unusually large number of observers (by the standards of colour discrimination research). Thirdly, they make their code and materials freely available with the intention of supporting progress and innovation. These tools are likely to be widely used in vision science, and could of course be used to address analogous problems for other sensory modalities and beyond.

      Weaknesses:

      It would be nice to better understand what constraints the choice of basis functions puts on the space of possible solutions. More generally, could there be particular features of colour discrimination (e.g., rapid changes near the white point) that the model captures less well.

      This comment bears conceptual similarity to Reviewer 1’s question about the hyperparameters of our prior, as it is basically asking whether we might be oversmoothing through the choice of form and number of basis functions. The hyperparameter sweeps we now present suggest that within the choice of basis functions we used, we are operating at a reasonable point on the bias-variance tradeoff curve - we can see bias emerging with a smoother prior, and variance increasing with a less smooth prior. Our expectation is that varying the smoothness of the prior in other ways, such as by varying the form and number of the basis functions, would lead to similar tradeoffs.

      We did perform one additional check that shows, within our current framework, that adding more basis functions is unlikely to change things much. This was to plot the fit weights as a function of Chebyshev basis order (Figure S4 in Appendix 2). These decline to near zero at the highest order we used, suggesting that adding more would not alter the inferred psychometric field, given our hyperparameter choices. Although we could explore this question further by explicitly fitting the data using more basis functions along with different hyperparameter choices, or different functional forms for the basis functions, we decided not to pursue this in favor of performing the other additional analyses we now present.

      We resonate with the reviewer’s concern that assuming smoothness, both by assuming that isoperformance contours are elliptical and by assuming that these vary smoothly with reference, might cause us to miss features of the true underlying field in cases where that field varies rapidly or the isoperformance contours are asymmetric or non-elliptical. Our approach to this was to measure the validation thresholds and demonstrate that any bias in our WPPM-inferred field is small for these measurements. Because we shared the reviewer’s intuition that the adapting point is a candidate location where there might be less smooth variation, we measured a validation threshold at this reference for every subject. Nonetheless, we only measured in one direction around the adapting reference for each subject. We considered validation approaches where we measured full ellipses at a set of validation references, but we were worried about effects of uncertainty reduction and perceptual learning which might distort thresholds at highly sampled locations.

      It is the case that if one wanted to study the discrimination field in more detail around a particular reference, one could concentrate trials in a smaller model space around that reference, and for the same number of trials use a prior with less smoothness relative to the underlying stimulus space. Indeed, simply halving the size of the stimulus space that maps onto the [-1,1] model space and keeping the same prior over the model space effectively halves the degree of smoothness expressed with respect to the stimulus space. Thus our methods could prove useful in studying more rapid variations in the discrimination field if one hypothesized that they might occur around particular reference choices, but this would still rest upon the elliptical assumption. To relax that assumption, one could use the threshold field estimation methods implemented in AEPsych, which incorporate a smoothness assumption but do not assume elliptical isoperformance contours. Weakening the prior in this way would, however, increase trial demand to obtain similar measurement precision.

      As a general matter, we don’t think it is possible to leverage smoothness for trial efficiency on the one hand and at the same time be completely sure that there isn’t some aspect to the underlying ground truth that has been smoothed over. Carefully choosing the degree of prior smoothness together with the number of experimental trials in the context of a particular content problem is an important part of bringing the WPPM and related methods to bear, and one where simulation and held-out data both play an important role.

      We now bring these points out more fully in the paper (Discussion / Extensions of the WPPM framework; Discussion / Prior specification).

      Chen, C.-C., J. M. Foley and D. H. Brainard (2000). "Detection of chromoluminance patterns on chromoluminance pedestals I: threshold measurements." Vision Research 40(7): 773–788.

      The substantial individual differences evident in Figure S20 (comparison with Krauskopf and Gegenfurtner, 1992) are interesting in this context. Some observers show radial biases for the discrimination ellipses away from the white point, some show biases along the negative diagonal (with major axes oriented parallel to the blue-yellow axis), and others show a mixture of the two biases. Are these genuine individual differences, or could the model be performing less accurately in this desaturated region of colour space?

      We agree that these differences are interesting. We have now added more complete bootstrapped confidence regions in these (Appendix 8) and the other comparison figures (Appendix 6, 7, 9), so that an estimate of measurement precision is directly available in these figures. These confidence regions suggest that the individual differences in this region of color space are real. A longer-term goal is to develop more mechanistic models that can account for individual subject data through parameter choice. This might lead to insight into what differs in the visual system across individuals.

      Reviewer #3 (Public review):

      Summary:

      This study presents a powerful and rigorous approach for characterizing stimulus discriminability throughout a sensory manifold, and is applied to the specific context of predicting color discrimination thresholds across the chromatic plane.

      Strengths:

      Color discrimination has played a fundamental role in studies of human color vision and for color applications, but as the authors note, it remains poorly characterized. The study leverages the assumption that thresholds should vary smoothly and systematically within the space, and validates this with their own tests and comparisons with previous studies.

      Weaknesses:

      The paper assumes that threshold variations are due to changes in the level of intrinsic noise at different stimulus levels. However, it's not clear to me why they could not also be explained by nonlinearities in the responses, with fixed noise. Indeed, most accounts of contrast coding (which the study is at least in part measuring because the presentation kept the adapt point close to the gray background chromaticity, and thus measured increment thresholds), assume a nonlinear contrast response function, which can at least as easily explain why the thresholds were higher for colors farther from the gray point. It would be very helpful if a section could be added that explains why noise differences rather than signal differences are assumed and how these could be distinguished. If they cannot, then it would be better to allow for both and refer to the variation in terms of S/N rather than N alone.

      We agree with the reviewer. We are measuring SNR and attributing it to noise, but cannot identify from the data whether changes in SNR across color spaces are due to changes in noise, to a nonlinear relationship between stimulus space and the observer’s response space with noise in the response space held fixed, or both. We now make this point where we introduce the Results / Wishart Process Psychophysical Model and reiterate it in the Discussion / Extensions of the

      WPPM framework.

      Related to this point, the authors note that the thresholds should depend on a number of additional factors, including the spatial and temporal properties and the state of adaptation. However, many of these again seem to be more likely to affect the signal than the noise.

      We don’t disagree. Indeed, as we noted in our response to a comment by Reviewer 1 and above in the context of individual differences, we are very interested in developing a mechanistically plausible model that accounts for the data. If we or others are able to do so, that would provide a basis for parsing performance into separate signal and noise effects. And if such a model has natural ways in which additional variables affect its predictions, measuring the effects of these variables would be a way to provide evidence in favor of the model (Discussion / Implication for the mechanisms of color perception - Extensions of the WPPM framework).

      An advantage of the approach is that it makes no assumptions about the underlying mechanisms. However, the choice to sample only within the equiluminant plane is itself a mechanistic assumption, and these could potentially be leveraged for deciding how to sample to improve the characterization and efficiency. For example, given what we know about early color coding, would it be more (or less) efficient to select samples based on a DKL space, etc?

      The more we are willing to assume about the structure of the psychometric field, the more efficiently we can measure it. As the reviewer correctly notes, this principle applies to trial placement as well. We are currently using an adaptive method (AEPsych) that starts with a fairly weak smoothness prior and attempts to place trials using heuristics that aim to minimize the expected uncertainty in the posterior. As we learn more about the discrimination field, we should be able to leverage stronger priors to increase trial efficiency. This point is closely related to one we made above about developing stronger priors that capture what we have learned in this study. Such priors could also help improve trial placement. For a prior that has a relatively small number of parameters, for example, perhaps a mechanistic prior, methods such as Quest+ (Watson, 2017) may be used for trial placement.

      Watson, A. B. (2017). "QUEST+: A general multidimensional Bayesian adaptive psychometric method." J Vis 17(3): 10.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      I do not think that the authors need to perform additional experiments. However, I would like to see some additional analyses regarding the assumptions made in the fitting procedure and how they affect the final maps.

      I also think some more quantitative comparisons with historical data would be valuable - at the moment, a lot of the comparisons are simply 'by eye'.

      It would have been nice to have the code and data available during the review procedure - I'm sure these will be released with excellent documentation?

      We addressed the first two points in the public review section. The code is now available online as is the data. These links are now provided in the paper (Methods and Materials / Data and code availability).

      Reviewer #2 (Recommendations for the authors):

      Minor points

      I have a few suggestions for additions and small changes.

      (1) Several examples of covariance matrix fields are shown in Figure 1, 4, but these are for simulated examples. It would be nice to see the fields actually fit the data! I would be interested in seeing this for all participants in an Appendix, and maybe for participant CH in the main paper?

      We have made the changes (see Figure 4 and Figure S3).

      (2) I have not worked through all the math in the appendices line by line, but it seems to be complete, and the model validation results speak for themselves. I think the authors have done a pretty good job of explaining the model conceptually (not easy), but I struggled with the 'weighted sum' step in Figure 4 and the main text. I would appreciate a bit more hand-holding here, e.g, why is an 'overcomplete' representation needed as an intermediate, and providing an intuition of why there are 12 matrices in the overcomplete representation and what each matrix in this representation represents.

      We have now added more explanations in the figure legend and text (Fig. 4 and Methods and Materials / The Wishart Process Psychometric Model).

      (3) Individual differences: There is a section on this in the manuscript, and it's concluded that there are only "modest" individual differences. However, in Figure S20, the individual differences, I think, are huge and place observers almost in qualitatively different categories! Some observers show a radial bias in discrimination ellipses, others seem to show basically a bias along the negative diagonal, and others a mixture of both biases. These ellipses are at a desaturated part of colour space - is it possible that there are some rapid changes in the underlying noise in this region that the Wishart fit has not captured due to relatively sparse sampling or the fact that the basis functions are all fairly low spatial frequency? I wondered whether the results are constrained by the choice of Cartesian rather than polar basis functions, e.g, polar basis functions may have better allowed fine-grained changes near the white point but slower changes at higher saturations away from the white point.

      We agree that the individual differences are meaningful and, in some cases, quite pronounced. Our intent in describing the differences as “modest” was to emphasize that the overall structure of the psychometric fields remains broadly consistent across observers. We have revised the Results to note and more fully describe these differences.

      Regarding the possibility that sharp changes in the underlying noise near the achromatic point might not be fully captured by the current model, we agree that this is an important consideration. The current implementation uses relatively low-order Chebyshev basis functions that primarily capture smooth global variations in the psychometric field. While validation analyses indicate that these basis functions capture the dominant structure in the data, they may be less sensitive to sharp local variations such as those that could occur near the white point. Future work could address this by mapping the model space to a smaller region around the achromatic reference or by exploring alternative basis sets (e.g., polar or Zernike functions) that may better capture such localized structure. This is discussed above in this response and now addressed in Discussion / Extensions of the WPPM framework.

      On sampling, I wondered if the results might have been biased by the strongly biased ellipse that occurs at the grey point. If not, and the model is accurate in this region of colour space, I think this figure does show some large individual differences, and it would be good to comment on these in the individual differences section of the manuscript.

      Based on our analysis of trial placement (Fig. S1), the adaptive algorithm does not appear to have disproportionately concentrated trials near the gray point. In fact, more trials were allocated to the edges of the stimulus space than to the center. This suggests that the WPPM estimates are unlikely to be driven primarily by performance in the gray region. In addition, we examined the threshold ellipses around the gray reference in DKL space and found that they are broadly consistent across participants (Figs. S22–S23). Together, these analyses suggest that the anisotropy observed near the gray point reflects a genuine property of the psychometric field rather than an artifact of the sampling procedure.

      As noted just above, we have added additional text about individual differences in the Results and referenced it in the Discussion.

      (4) The manuscript seems unusually free of typographical errors, but I noticed that in many places "Krauskopf and Karl 1992" is cited! Also, I think something has gone wrong with the legend to Figure 2 - perhaps the order of panels was swapped around, but the legend was not fully updated. There is a repeated reference to the "summary of regression slopes" which seems to be in 2 positions, after C and G. It would make more sense to label panel G as D and progress from there, or switch the order of the panels so that G is on the bottom row.

      Thank you for catching those errors. They are now fixed.

      Reviewer #3 (Recommendations for the authors):

      A minor point (or perhaps major if your last name is Gegenfurtner) is that the reference to Krauskopf and Karl is incorrect.

      They are now fixed.

    1. eLife Assessment

      This study addresses the mechanism of action of benzoylurea insecticides and explores the metabolic consequences of inhibiting glycogen breakdown in insects. Both reviewers identify major flaws with the premise of the work. The strength of the provided evidence is inadequate as the data do not, or poorly, support several central claims. The significance of the findings is considered marginal.

    2. Reviewer #1 (Public review):

      Summary:

      In this study, the authors investigate whether glycogen phosphorylase is a potential molecular target of benzoylphenylurea insecticides and examine the physiological consequences of inhibiting glycogen breakdown in the diamondback moth Plutella xylostella. The authors express and characterize recombinant glycogen phosphorylase, test its inhibition by a mammalian glycogen phosphorylase inhibitor and by the insecticide diflubenzuron, and assess the physiological effects of glycogen phosphorylase inhibition through chemical exposure and RNA interference. Based on these experiments, the authors conclude that benzoylphenylurea insecticides do not target glycogen phosphorylase and propose that insects compensate for glycogen phosphorylase inhibition through activation of gluconeogenesis, allowing them to maintain glucose homeostasis and complete development despite strong suppression of the enzyme.

      Strengths:

      The study addresses an interesting and long-standing question in insect toxicology regarding the mechanism of action of benzoylphenylurea insecticides. The authors combine several complementary approaches, including recombinant enzyme characterization, inhibitor assays, RNA interference, gene expression analyses, and metabolite measurements. The biochemical characterization of the recombinant glycogen phosphorylase and the demonstration that the tested glycogen phosphorylase inhibitor can strongly inhibit enzyme activity represent important technical strengths. In addition, the study integrates biochemical and physiological observations to explore how insects might compensate for disruptions in central carbohydrate metabolism.

      Weaknesses:

      Several aspects of the central conclusions rely on indirect evidence and would benefit from additional validation. The proposed compensatory mechanism (gluconeogenesis supported by amino acid mobilization) is inferred primarily from transcriptional changes in gluconeogenic genes, reduced protein levels, and changes in metabolite concentrations. While these observations are consistent with increased gluconeogenic activity, they do not directly demonstrate metabolic flux through this pathway. Direct measurements of gluconeogenic flux would be required to confirm that carbon derived from non-carbohydrate substrates contributes to glucose production.

      Some interpretations are also speculative. For example, the lack of glycogen accumulation following glycogen phosphorylase knockdown is attributed to alternative glycogen degradation pathways, such as α-amylase or glycogen debranching enzymes, but these possibilities are not experimentally examined. Measuring the expression or activity of these enzymes would help evaluate whether such pathways contribute to the observed metabolic response.

      The physiological consequences of the proposed metabolic compensation are also not fully explored. If proteins are mobilized to support gluconeogenesis, this shift might be expected to affect organismal traits such as adult body size, flight capacity, or reproductive performance. Assessing these traits could provide valuable insight into whether the proposed compensatory metabolism carries fitness costs.

      Finally, some conclusions extend beyond the direct evidence presented. The study shows that diflubenzuron does not inhibit glycogen phosphorylase in vitro, but broader conclusions regarding the mechanism of action of benzoylphenylurea insecticides as a class may require additional evidence. In addition, some biochemical and cell-based observations would benefit from confirmation in whole insects, given that metabolic regulation can differ substantially between isolated enzyme or cell-based systems and intact larvae, where hormonal signaling, tissue interactions, and nutrient availability influence metabolic responses.

    3. Reviewer #2 (Public review):

      (1) Significance of the findings and strength of the evidence

      This manuscript evaluates the hypothesis that benzoylurea (BPU) insecticides exert their effects through inhibition of glycogen phosphorylase rather than chitin synthase (CHS). The central premise-that structural similarity among acylurea compounds implies shared molecular targets-is not supported by existing evidence.

      Extensive genetic and biochemical studies, including Reference 5, demonstrate that chitin synthase is the primary insecticidal target of BPUs. In particular, amino acid substitutions at a single site in CHS confer high levels of resistance to diflubenzuron and related compounds, with causality established through CRISPR/Cas9 editing in Drosophila melanogaster. This body of evidence substantially weakens the rationale for proposing glycogen phosphorylase as an alternative primary target.

      The manuscript reports that an acylurea compound previously identified as an inhibitor of mammalian glycogen phosphorylase also inhibits glycogen phosphorylase from Plutella xylostella, while diflubenzuron does not. This observation is consistent with prior work showing that glycogen phosphorylase inhibition among acylureas depends on specific side chain substitutions rather than the shared acylurea core. Consequently, the finding does not support the broader inference that acylurea structure predicts common biological function.

      The manuscript further argues that inhibition of glycogen phosphorylase is not insecticidal and attributes this to metabolic compensation through alternative glucose producing pathways. While it is well established that eukaryotic cells possess multiple mechanisms for maintaining glucose availability, the evidence provided here does not fully support the broader claim that this mechanism explains the lack of insecticidal activity. In particular, the conclusion that the study "resolves" the primary hypothesis is not justified by the data presented.

      Overall, while some experimental observations are sound in isolation, the overarching conclusions are not supported by the strength of the evidence. The significance of the findings is therefore limited.

      (2) Interpretation in the context of existing literature

      The introduction states that the molecular target of BPU insecticides remains a major unresolved controversy. However, multiple prior studies, including References 1, 4, and 5, provide strong genetic evidence that CHS is the primary and essential target of BPUs. These results demonstrate causality rather than simple correlation, particularly through targeted gene editing approaches.

      The manuscript further claims that biochemical studies have failed to demonstrate CHS inhibition by BPUs in cell free assays. However, the cited references (6-9) did not express CHS in such assays and therefore do not directly address this question. As a result, the suggested discrepancy between genetic and enzymatic evidence is not well founded.<br /> Structural analysis of acylurea compounds indicates that biological activity depends on side chain composition rather than the conserved acylurea core. Prior screening studies (Reference 11) show substantial variability in glycogen phosphorylase inhibition among acylureas despite a shared core structure. This undermines the proposal that the acylurea moiety itself constitutes a meaningful clue to a shared molecular mechanism.

      Regarding implications for pesticide design, targeting chitin synthesis remains an attractive strategy because chitin is essential for arthropods and absent in mammals, providing both efficacy and specificity. By contrast, metabolic enzymes such as glycogen phosphorylase are widely conserved, making them less suitable targets from a toxicological and safety perspective.

      (3) Specific technical comments

      The manuscript uses the term "dataology," which is neither defined nor contextualized within the text. As currently used, the term appears unrelated to the subject matter and may be confusing to readers. Clarification or removal would improve clarity.

    4. Author response:

      Public Reviews:

      Reviewer #1 (Public review):

      (1) The proposed compensatory mechanism is inferred primarily from transcriptional changes and metabolite levels; direct measurements of gluconeogenic flux are lacking.

      We agree that isotopic tracer experiments would provide the most direct evidence for gluconeogenic flux. While such experiments are beyond the scope of the current revision, we will explicitly acknowledge this as a key limitation and clearly state it as an important direction for future research. We note, however, that the convergent evidence from multiple independent lines, transcriptional upregulation of PEPCK and G-6-Pase, declining protein levels, altered amino acid profiles, and maintained trehalose levels, collectively supports gluconeogenic activation, even though each individual line is indirect. In the revised manuscript, we will present this evidence more cautiously, framing it as “consistent with gluconeogenic compensation” rather than definitively establishing metabolic flux.

      (2) Alternative glycogen degradation pathways (α-amylase, glycogen debranching enzymes) are proposed but not experimentally examined.

      We have now directly addressed this by measuring, via RT-qPCR, the expression of glycogen branching enzyme (GBE) and α-amylase following PxGP knockdown. Our preliminary results reveal a striking and informative pattern:

      GBE was significantly upregulated at 24 h (+29.24%), 48 h (+16.78%), and 96 h (+44.46%) post-injection, indicating transcriptional activation of an alternative glycogen-metabolizing enzyme in response to GP suppression.

      α-Amylase showed no significant change at any time point, suggesting that the compensatory response is pathway-specific rather than a generalized upregulation of all starch/glycogen-degrading enzymes.

      This differential response, GBE up while α-amylase unchanged, provides the first direct evidence that P. xylostella selectively activates specific alternative glycogen catabolic pathways when GP function is compromised. These data will be incorporated into the revised manuscript as a new figure panel.

      (3) Physiological consequences of the proposed metabolic compensation (fitness costs) are not explored.

      We have now assessed fitness consequences of PxGP knockdown by measuring feeding rate, larval body weight, and pupal weight. The results reveal a transient but significant fitness cost:

      Feeding rate: no significant difference between dsGP and dsGFP groups across all time points (24–120 h), indicating that the observed metabolic changes are not attributable to reduced food intake.

      Larval weight: significantly reduced at 24 h (−29.10%) and 48 h (−25.38%) in the dsGP group, demonstrating that metabolic compensation carries a measurable short-term cost.

      Pupal weight: no significant difference, indicating that larvae recover from the transient weight deficit before pupation.

      This pattern, transient larval weight loss with full pupal recovery, is consistent with our proposed model: GP suppression triggers protein catabolism to fuel gluconeogenesis (explaining the weight loss), but the compensatory mechanism is sufficiently effective to restore metabolic homeostasis before the pupal transition. Adult wing area and female fecundity measurements are currently in progress and will be included in the revised manuscript.

      (4) Enzyme activity is not measured in RNAi-treated insects; only transcript-level knockdown is reported.

      We have now measured GP enzyme activity (GPa) in crude extracts from RNAi-treated larvae using the coupled-enzyme spectrophotometric assay. The results provide important new insights:

      Per-larva GP activity was significantly reduced at 24 h (−27.57%) and 48 h (−29.28%), confirming that RNAi-mediated transcript suppression translates to reduced enzyme function in vivo.

      Per-protein GP activity showed a significant reduction only at 48 h (−10.35%). This apparent discrepancy is explained by a substantial decrease in total protein concentration at 24 h (−44.48%), which then gradually recovered. When enzyme activity is normalized to a declining protein pool, the per-protein reduction appears smaller.

      Importantly, the 44.48% decline in total protein at 24 h provides independent biochemical confirmation of our proposed protein catabolism: it is consistent with the mobilization of protein stores to supply amino acids for gluconeogenesis, directly supporting the compensatory mechanism described in our manuscript.

      These enzyme activity data will be presented alongside the existing transcript-level data in the revised manuscript, providing a complete picture from gene expression through enzyme function.

      (5) Conclusions regarding BPU class may require testing additional compounds beyond diflubenzuron.

      We agree and will explicitly limit our conclusion to diflubenzuron in the revised manuscript. The relevant text will be revised to state that “DFB does not inhibit PxGP” rather than making broader claims about the BPU class as a whole.

      (6) Structural evidence that GPI can bind PxGP in a comparable manner to its mammalian target is lacking.

      We have performed molecular docking and binding free energy analysis to address this concern directly. The PxGP homodimer structure was modeled using SWISS-MODEL with the rabbit muscle GP–acyl urea co-crystal structure (PDB: 2ATI; Klabunde et al., 2005) as the template. Molecular docking and MM/GBSA calculations were performed using Cresset Flare V11.

      Key findings:

      GPI exhibited substantially stronger binding to PxGP (ΔG = −34.63 kcal/mol) compared to DFB (ΔG = −29.29 kcal/mol), with a ΔΔG of −5.34 kcal/mol.

      Energy decomposition revealed that van der Waals interactions are the primary driver of selectivity (ΔG<sub>VDW</sub> = −11.49 kcal/mol), reflecting superior shape complementarity of GPI within the binding pocket.

      GPI was predicted to bind at the allosteric site at the dimer interface, engaging seven residues across both subunits (Asn44 and Val45 from chain A; Trp67, Gln71, Tyr75, Arg193, and Asp227 from chain B), a binding mode consistent with the experimentally determined site of acyl urea inhibitors in mammalian GP.

      DFB contacted only six residues, primarily from a single subunit, and its difluorobenzoyl moiety remained entirely solvent-exposed without productive protein contacts, explaining its inability to achieve effective target engagement.

      These structural data, together with the biochemical inhibition data (IC<sub>50</sub> = 2.96 nM for GPI; no inhibition by DFB), provide a comprehensive molecular explanation for the observed selectivity. The results will be presented as a new figure and table in the revised manuscript.

      (7) Dietary carbohydrates could mask the metabolic effects of GP inhibition.

      Our new data showing no difference in feeding rate between dsGP and dsGFP groups addresses this concern from one angle: the metabolic changes we observe are not attributable to altered food intake. We will also add a discussion of the potential contribution of dietary carbohydrates to glucose homeostasis and acknowledge this as a caveat in interpreting the metabolite data.

      Minor points: All terminology errors (“gluconeogenolysis” → “gluconeogenesis”), typographical errors (“over over four decades”), and formatting inconsistencies will be corrected. We will clarify the metabolite normalization approach and improve figure labeling and pathway schematics.

      Reviewer #2 (Public review):

      (1) The central premise — that structural similarity among acylurea compounds implies shared molecular targets — is not supported by existing evidence.

      We agree that the original manuscript overstated the significance of the shared acylurea core as a predictor of common biological activity. In the revised manuscript, we will substantially restructure the Introduction to:

      (1) Explicitly acknowledge the compelling genetic evidence from CRISPR/Cas9 experiments (Reference 5) establishing CHS as the primary site conferring BPU resistance.

      (2) Reframe the study’s objective: rather than proposing to “resolve” the BPU target controversy, the revised manuscript will focus on the systematic evaluation of GP as an independent insecticidal target and the discovery of a gluconeogenic compensation mechanism, questions that have scientific value independent of the BPU mechanism debate.

      (3) Remove the claim that the study “resolves the primary hypothesis.” The conclusion will instead state that our biochemical data demonstrate DFB does not inhibit PxGP, adding enzyme-level evidence to the existing genetic framework.

      (2) Target selectivity among acylurea compounds is determined by side-chain composition, not the shared core.

      We fully agree, and our new structural data now provide a molecular explanation for this principle at the atomic level. Molecular docking reveals that both GPI and DFB anchor to PxGP through their common acylurea carbonyl groups (forming hydrogen bonds with Arg193), but diverge dramatically in their side-chain engagement: GPI’s methoxyphenyl-methylurea moiety engages five additional residues across the dimer interface, while DFB’s difluorobenzoyl group remains entirely solvent-exposed. The van der Waals energy difference (ΔΔG<sub>VDW</sub> = −11.49 kcal/mol) quantitatively reflects this differential shape complementarity. These data directly support Reviewer 2’s point and will be presented as new evidence in the revised manuscript.

      (3) References 6–9 did not express CHS in cell-free assays.

      We will revise the relevant passage for greater precision. Our revised text will distinguish between (a) the absence of direct biochemical evidence for BPU-mediated CHS inhibition in cell-free systems and (b) the technical challenge of expressing and purifying functional CHS for such assays. This distinction will be stated more carefully to avoid any mischaracterization of the cited literature.

      (4) The term “dataology” is non-standard.

      This term has been removed and replaced with “data.” In accordance with eLife’s policy on the use of AI tools and technology, we will include a statement in the Materials and Methods section declaring that AI-based language editing tools were used for English grammar and style refinement. All scientific content was generated entirely by the authors.

      Author response table 1.

      We are confident that the substantial new experimental data and restructured narrative will meaningfully strengthen the manuscript.

    1. eLife Assessment

      The study provides valuable findings suggesting that modifying the donor's diet improves the effectiveness of fecal transplant therapies for liver disease. Although the reported results are of value, the evidence supporting the overall conclusions is incomplete. In particular, causal inferences regarding the effects of microbiota composition, as well as caproic acid signaling on the phenotypes studied, need further confirmation.

    2. Reviewer #1 (Public review):

      Summary:

      The authors aimed to determine whether dietary conditioning of fecal microbiota donors can influence the therapeutic efficacy of fecal microbiota transplantation (FMT) in alcohol-associated liver disease (ALD). Specifically, they tested whether donor diets enriched in vegetable or egg-derived proteins alter microbiota composition and function in ways that enhance recovery from alcohol-induced liver injury. Using a murine ALD model, the study integrates microbiome profiling, metabolomics, proteomics, and functional assays to identify mechanisms underlying improved outcomes. The authors propose that vegetable protein-conditioned microbiota promote beneficial microbial remodeling and increased production of caproic acid, which in turn activates hepatic PPARα signaling and enhances fatty acid β-oxidation, thereby reducing steatosis and inflammation.

      Strengths:

      The study is ambitious and methodologically comprehensive. The central idea, that donor diet can modulate FMT efficacy in ALD, is compelling and potentially impactful. It combines in vivo disease models, microbiome analysis (16S rRNA sequencing), metabolomics and proteomics, pharmacological inhibition experiments, and in vitro validation in hepatocytes. This multi-layered approach is a clear strength and allows the authors to explore the gut-liver axis. The comparison between different protein sources (vegetable vs egg) is very interesting, and the PPARα inhibition experiments provide relatively strong functional support for the involvement of host metabolic signaling pathways in mediating the observed effects.

      Weaknesses:

      Despite the comprehensive scope of the manuscript, several aspects of the study limit the strength of its mechanistic conclusions. The causal attribution to caproic acid remains incomplete. While caproic acid is identified and functionally tested, there is no direct demonstration that it is necessary for the Veg-FMT phenotype in vivo. The metabolomics data suggest multiple candidate metabolites, but these are not systematically explored. The study identifies specific bacterial taxa and, separately, key metabolites, but does not establish a direct connection between microbial composition and metabolite production. The use of GW6471 supports involvement of PPARα but does not fully establish specificity, as off-target effects cannot be excluded. Finally, it is not fully clear whether effects are exclusively microbiota-driven or could partially reflect the transfer of diet-derived metabolites.

      The authors successfully demonstrate that donor dietary conditioning influences the therapeutic efficacy of FMT in a murine model of ALD. The data convincingly show that vegetable protein-conditioned microbiota is associated with improved liver injury, reduced inflammation, and enhanced intestinal barrier integrity compared with controls or an egg protein-enriched diet. While the proteomic and gene expression data suggest activation of pathways related to fatty acid β-oxidation, these measurements do not directly demonstrate increased metabolic flux. The use of the PPARα antagonist GW6471 provides important functional support for the involvement of this pathway, as inhibition attenuates the protective effects of Veg-FMT. However, this approach primarily establishes pathway dependency rather than directly confirming enhanced β-oxidation activity. The authors may therefore wish to moderate their interpretation or clarify this distinction, particularly given the relatively modest fold changes observed in several targets. The role of caproic acid as a central mediator is plausible but not definitively established. Finally, the link between microbiota composition, metabolic function, and host signaling remains partly correlative. Overall, the study achieves its primary aim at a phenotypic level, but some of the mechanistic claims would benefit from more cautious interpretation or additional validation.

      Likely impact of the work on the field, and the utility of the methods and data to the community:

      The work addresses an important and underexplored question: how donor characteristics influence FMT efficacy. By introducing donor diet as a modifiable variable, the study has potential implications for optimizing microbiota-based therapies. The datasets (microbiome, metabolomics, and proteomics) may also be valuable to the community, as they provide a resource for exploring gut-liver metabolic interactions. The translational impact will, however, depend on validation in human systems and a clearer identification of causal mechanisms.

    3. Reviewer #2 (Public review):

      The manuscript explores a valuable strategy for optimizing Fecal Microbiota Transplantation (FMT) efficacy in alcoholic liver disease through donor dietary intervention. I have identified several critical logical gaps, missing links in the evidence chain, and methodological ambiguities that require detailed explanation and supplementation.

      (1) While the Methods section states that each recipient mouse group consisted of 16 animals, microbiome sequencing was performed on only 4 samples per group. This sample size is insufficient, and the high inter-individual variability observed reduces the statistical power and representativeness of the data. I recommend increasing the sequencing sample size or, at a minimum, explicitly acknowledging the risk of false positives due to the small sample size in the Discussion.

      (2) The layout of Figure 4 should be adjusted. Panel A should be enlarged for better visibility, while Panel B should be reduced in size to balance the figure composition.

      (3) A rationale should be provided for the selection of egg white protein as the animal protein control. Does this adequately represent animal proteins in general? Could the results differ if casein or whey protein were used? The current choice limits the generalizability of the conclusions, and this limitation should be addressed.

      (4) The ALD model was established over 12 weeks, yet the FMT intervention consisted of only 3 administrations with a 1-week observation period. In the context of such a severe liver injury model, a 1-week recovery period appears insufficient to observe genuine fibrosis reversal, which typically requires a longer timeframe. The authors should discuss whether short-term FMT can truly induce structural remodeling or if the observed effects are transient.

      (5) The results rely heavily on PICRUSt2 for functional prediction. As prediction does not equate to factual validation, the authors should exercise caution in their wording within the Discussion. Alternatively, I recommend supplementing the study with shotgun metagenomic sequencing to verify the existence of these pathways rather than relying solely on predictive algorithms.

      (6) Although Egg-FMT was less effective than Veg-FMT, it performed better than the standard FMT or abstinence groups. Why is the effect of egg white protein intermediate? Is this due to rapid digestion resulting in insufficient substrate, or differences in metabolite production? A deeper comparative analysis of the Egg-FMT group is required, rather than treating it merely as a negative control.

      (7) Relying solely on the "inhibitor blocking effect" proves only that Caproic acid's function is dependent on the PPARα pathway, not that it directly acts on PPARα. To claim direct activation, the authors must demonstrate direct binding between Caproic acid and the PPARα protein (e.g., via SPR or MST assays). Alternatively, a luciferase reporter assay driven specifically by PPARα response elements (PPRE) should be conducted. If Caproic acid induces luminescence, it would confirm transcriptional activation of PPARα rather than mere downstream activation.

    4. Author response:

      We thank the Reviewing Editor, Senior Editor, and both reviewers for their constructive evaluation of our manuscript. We are encouraged that the reviewers found the central question, whether donor dietary conditioning modulates FMT efficacy in ALD, compelling and the multi-omics framework a strength. Their critiques converge on a shared theme: the manuscript's mechanistic claims around caproic acid and PPARα signaling currently rest on associative and pathway-level evidence, and would benefit from more direct causal testing and more guarded language. We agree, and we outline below the revisions we plan to undertake.

      Public Reviews:

      Reviewer #1 (Public review):

      While the proteomic and gene expression data suggest activation of pathways related to fatty acid β-oxidation, these measurements do not directly demonstrate increased metabolic flux. The use of the PPARα antagonist GW6471 provides important functional support for the involvement of this pathway; however, this approach primarily establishes pathway dependency rather than directly confirming enhanced β-oxidation activity. The role of caproic acid as a central mediator is plausible but not definitively established. Finally, the link between microbiota composition, metabolic function, and host signaling remains partly correlative.

      We thank the reviewer for this thoughtful assessment. We agree that the GW6471 inhibition experiments primarily support pathway dependency rather than direct activation of PPARα by caproic acid, and we will revise the manuscript accordingly to avoid overstating mechanistic conclusions. However, we would like to clarify that the objective of the current study was not to directly quantify metabolic flux. We agree that metabolic flux should not be used here. We will be modifying this in the text to make it clear that we measured mitochondrial beta oxidation as a response to caproic acid.

      To functionally assess alterations in fatty acid β-oxidation capacity, we performed Seahorse Mito Fuel Flex assays, which demonstrated altered dependency and utilization of fatty acid oxidation pathways in response to caproic acid treatment. We will further clarify this distinction in the revised.

      In addition, we agree that the role of caproic acid as a central mediator and the relationship between microbiota composition, metabolite production, and host signaling remain partly correlative. Therefore, we will moderate the interpretation throughout the manuscript and incorporate additional correlation analyses between microbial taxa, caproic acid levels, and disease-associated metabolic parameters to strengthen the microbiota-metabolite-host association while acknowledging the associative nature of these findings.

      Reviewer #2 (Public review):

      (1) While the Methods section states that each recipient mouse group consisted of 16 animals, microbiome sequencing was performed on only 4 samples per group. This sample size is insufficient, and the high inter-individual variability observed reduces the statistical power and representativeness of the data. I recommend increasing the sequencing sample size or, at a minimum, explicitly acknowledging the risk of false positives due to the small sample size in the Discussion.

      We thank the reviewer for this important comment. We would like to clarify that microbiome sequencing was performed on 6 samples per group and not on 4 samples per group, and we will revise the Methods section to improve clarity regarding the number of biological replicates analyzed. The 4 samples were used only for whole proteome analysis.

      In addition, several previously published murine microbiome studies investigating gut microbial alterations in liver disease and FMT interventions have used comparable sample sizes (typically 5-8 animals per group) for 16S rRNA sequencing analyses [1–3]. Nevertheless, we agree that inter individual variability may influence microbiome analyses, and therefore we will explicitly acknowledge this limitation and the possibility of reduced statistical power in the revised Discussion section. We will also ensure that interpretations derived from microbiome compositional analyses are presented more cautiously.

      (2) The layout of Figure 4 should be adjusted. Panel A should be enlarged for better visibility, while Panel B should be reduced in size to balance the figure composition.

      We thank the reviewer for this suggestion. We will revise the layout of Figure 4 accordingly by enlarging Panel A for improved visibility and reducing the size of Panel B to achieve a more balanced figure composition.

      (3) A rationale should be provided for the selection of egg white protein as the animal protein control. Does this adequately represent animal proteins in general? Could the results differ if casein or whey protein were used? The current choice limits the generalizability of the conclusions, and this limitation should be addressed.

      We thank the reviewer for this important suggestion. In the revised manuscript, we will provide additional rationale for selecting egg albumin as the animal-derived protein source. Egg albumin was chosen because it is a well-characterized protein with high biological value, rapid digestibility, standardized composition, and has also been used in our previous ALD-related dietary intervention studies for experimental consistency [4].

      We agree that egg albumin does not represent all animal protein sources. Due to its rapid digestion and absorption, relatively less substrate may reach the distal gut for microbial fermentation compared with more complex proteins. In contrast, proteins such as casein or whey may generate distinct microbial and metabolite profiles and potentially different host responses.

      Accordingly, we will explicitly acknowledge this limitation in the revised manuscript and clarify that our findings should not be generalized to all animal-derived proteins.

      (4) The ALD model was established over 12 weeks, yet the FMT intervention consisted of only 3 administrations with a 1-week observation period. In the context of such a severe liver injury model, a 1-week recovery period appears insufficient to observe genuine fibrosis reversal, which typically requires a longer timeframe. The authors should discuss whether short-term FMT can truly induce structural remodeling or if the observed effects are transient.

      We thank the reviewer for this important and thoughtful observation. We agree that a one-week post-FMT observation period appears insufficient to conclude complete structural remodeling or durable fibrosis reversal in a chronic 12-week ALD model. Though it should be noted that the results achieved with the one week intervention suggest otherwise in this animal model of ALD. As can be observed from the immunohistochemistry of abstinence and treatment groups, which was further quantified for steatosis and fibrosis, there is a __% and __% reduction respectively in the treatment group. Thus we can safely conclude that in the given animal model, an alternate day FMT for 3 doses can reverse steatosis and fibrosis.

      In the revised manuscript, we will explicitly clarify this distinction.

      (5) The results rely heavily on PICRUSt2 for functional prediction. As prediction does not equate to factual validation, the authors should exercise caution in their wording within the Discussion. Alternatively, I recommend supplementing the study with shotgun metagenomic sequencing to verify the existence of these pathways rather than relying solely on predictive algorithms.

      We thank the reviewer for this important suggestion and agree that PICRUSt2-based analyses represent predictive functional inference rather than direct validation of microbial metabolic activity. We will explicitly acknowledge in the Results and Discussion that PICRUSt2 outputs are inferences rather than measurements, and we will integrate our metabolomics data to show where predicted microbial pathways (fatty acid salvage, β-oxidation related pathways) coincide with measurable metabolite shifts, providing observational support for the predictions.

      We would like to avoid doing metagenomic analysis to substantiate PICRUST2 findings primarily because metagenomic analysis would provide information on the set of genes each species carries, and not the functional state of the resulting pathways. To read out the pathways we would be left with the same two options of PICRUS2 or metabolome analysis. Yes, if we perform transcriptome analysis we can reach to a conclusion on which pathways are active. Which is likely to be similar to the readout we get from the end result of these pathways – the metabolome.

      (6) Although Egg-FMT was less effective than Veg-FMT, it performed better than the standard FMT or abstinence groups. Why is the effect of egg white protein intermediate? Is this due to rapid digestion resulting in insufficient substrate, or differences in metabolite production? A deeper comparative analysis of the Egg-FMT group is required, rather than treating it merely as a negative control.

      We thank the reviewer for this insightful observation. We agree that the Egg-FMT group demonstrated an intermediate phenotype and should not be interpreted merely as a negative control. We will modify the text in the manuscript to mention the outcomes with egg protein, wherever it missing. In the revised manuscript, we will modify the language accordingly and expand the Discussion.

      (7) “Relying solely on the ‘inhibitor blocking effect’ proves only that Caproic acid's function is dependent on the PPARα pathway, not that it directly acts on PPARα. To claim direct activation, the authors must demonstrate direct binding between Caproic acid and the PPARα protein (e.g., via SPR or MST assays). Alternatively, a luciferase reporter assay driven specifically by PPARα response elements (PPRE) should be conducted. If Caproic acid induces luminescence, it would confirm transcriptional activation of PPARα rather than mere downstream activation.”

      We thank the reviewer for this important and insightful suggestion. We agree that the current inhibitor-based experiments primarily support the involvement of the PPARα pathway and do not definitively establish direct interaction or transcriptional activation of PPARα by caproic acid. Accordingly, in the revised manuscript, we will moderate our interpretation and avoid statements implying direct activation based solely on the current data.

      We also agree that direct validation experiments such as SPR/MST-based binding assays or PPREdriven luciferase reporter assays would substantially strengthen the mechanistic conclusions. We are currently planning additional experiments to further evaluate the direct action of caproic acid on PPARα and will incorporate these analyses in future revisions and follow-up studies.

      With the pending experiments we request the Editors to kindly provide us a time of about 2 months to send back the revised manuscript.

      References:

      (1) Mitsinikos, F. T., Chac, D., Schillingford, N. & DePaolo, R. W. Modifying macronutrients is superior to microbiome transplantation in treating nonalcoholic fatty liver disease. Gut Microbes 12, 1792256.

      (2) Ferrere, G. et al. Fecal microbiota manipulation prevents dysbiosis and alcohol-induced liver injury in mice. J. Hepatol. 66, 806–815 (2017).

      (3) Zhang, Y., Li, P., Chen, B. & Zheng, R. Therapeutic effects of fecal microbial transplantation on alcoholic liver injury in rat models. Clin. Res. Hepatol. Gastroenterol. 48, 102478 (2024).

      (4) Mittal, A. et al. Protein supplementation differentially alters gut microbiota and associated liver injury recovery in mouse model of alcohol-related liver disease. Clin. Nutr. 46, 96–106 (2025).

    1. eLife Assessment

      This Review Article provides a compendium of advice for MD-PhD students to consider when deciding which, if any, clinical field they will select for residency training. It is grounded in published data and effectively considers factors including the potential for clinical disciplines to sustain research integration, provide mentorship, meet lifestyle expectations, and foster a long-term career as a research-focused physician-scientist.

    2. Reviewer #1 (Public review):

      [Editors' note: this version has been assessed by the Reviewing Editor without further input from the original reviewers. The review comments were minor and constructive, and the authors have been very responsive.]

      Summary:

      This brief piece by Swartz and colleagues outlines the complexities surrounding the choice of clinical specialty for physician-scientists. It is, in general, clear and well-written, and it will be useful to research-oriented medical students choosing a path and to the mentors who are guiding them.

      Strengths:

      The writing is clear. The points made are not profound, but they are important and will be of use to the intended audience.

    3. Reviewer #2 (Public review):

      Summary:

      This article is a useful compendium of advice for MD/PhD students (and research-focused MD students) to consider when it is time to decide on a clinical field for residency training. The authors are a distinguished group of physician-scientists and program directors who are drawing on published data and their own experience as mentors to provide advice and resources to students about to make what can be a career-defining choice. It makes an effective argument for considering important differences between clinical fields in their ability to sustain research integration, provide mentorship, meet lifestyle expectations, and foster a long-term career as a research-focused physician-scientist.

      Strengths:

      (1) A lot has been written about physician-scientists as an endangered species. Given the important role that physician-scientists can play if they engage in research that is informed by experience in patient care, not nearly enough has been written about the choices that students make during training that can keep them on track or throw them off.

      (2) The article provides not only general advice, but specific information in the 2 tables that can help trainees to weigh their priorities and consider their options.

      (3) Among the best advice is to weigh clinical demands, maintenance of procedural skills, recognition of the impact of research time on salary, and the impact of high salaries on the tension between research effort and clinical effort in clinical departments, which is where most physician-scientists in academia are employed.

    4. Author response:

      The following is the authors’ response to the original reviews

      eLife Assessment

      This Review Article provides a compendium of advice for MD-PhD students to consider when deciding which, if any, clinical field they will select for residency training. It is grounded in published data and effectively considers factors including the potential for clinical disciplines to sustain research integration, provide mentorship, meet lifestyle expectations, and foster a long-term career as a research-focused physician-scientist.

      We thank the editors for this positive assessment. We have revised the manuscript to sharpen the decision-making framework and make the advice more actionable, as detailed below.

      Public reviews:

      Reviewer #1 (Public review):

      This brief piece by Swartz and colleagues outlines the complexities surrounding the choice of clinical specialty for physician-scientists. It is, in general, clear and well-written, and it will be useful to research-oriented medical students choosing a path and to the mentors who are guiding them.

      We thank Reviewer #1 for these supportive comments.

      Strengths:

      The writing is clear. The points made are not profound, but they are important and will be of use to the intended audience.

      We appreciate this assessment and agree that the value of this piece lies in consolidating practical, experience-based guidance in one resource for trainees and mentors.

      Weaknesses:

      I have only minor suggestions for improvement. There are some areas of redundancy where the article could be tightened up by consolidating.

      We agree and have made substantial revisions to reduce redundancy throughout the manuscript. Specifically, we have streamlined the Introduction by removing a lengthy paragraph that previewed the article’s contents in a way that overlapped with later sections. The revised Introduction now concisely introduces five core decision-making factors (alignment between clinical and research interests, the structure of clinical work, availability of mentorship and research pathways, institutional culture, and financial sustainability) and directs readers to the new Table 1 and Figure 1 as organizing frameworks.

      We have also consolidated overlapping discussions of research alignment, protected time, and clinical demands. The sections on clinical workload and protected research time have been tightened to minimize repeated points about specialty-specific demands, and we now cross-reference Table 1 rather than re-stating the same considerations in multiple places. Prose has been revised throughout for concision and clarity.

      Reviewer #2 (Public review):

      This article is a useful compendium of advice for MD/PhD students (and research-focused MD students) to consider when it is time to decide on a clinical field for residency training. The authors are a distinguished group of physician-scientists and program directors who are drawing on published data and their own experience as mentors to provide advice and resources to students about to make what can be a career-defining choice.

      We thank Reviewer #2 for this generous and thoughtful evaluation.

      Strengths:

      (1) A lot has been written about physician-scientists as an endangered species. Given the important role that physician-scientists can play if they engage in research that is informed by experience in patient care, not nearly enough has been written about the choices that students make during training that can keep them on track or throw them off.

      We share this perspective and appreciate the reviewer’s recognition of this gap in the literature. Our goal was precisely to address the decision-making process itself, which is often under-discussed in formal publications despite being a frequent topic in mentoring conversations.

      (2) The article provides not only general advice, but specific information in the 2 tables that can help trainees to weigh their priorities and consider their options.

      Thank you. We have further strengthened the tabular content in this revision by adding a new Table 1 (described below) and renumbering the original tables accordingly.

      (3) Among the best advice is to weigh clinical demands, maintenance of procedural skills, recognition of the impact of research time on salary, and the impact of high salaries on the tension between research effort and clinical effort in clinical departments, which is where most physician-scientists in academia are employed.

      We appreciate this feedback and have made this advice more prominent by incorporating these factors explicitly into the new Table 1 framework and by adding a more direct statement in the text about how specialty-specific structural differences affect the ease of sustaining a research career.

      Area for Improvement

      (1) Some of the most useful pieces of advice are scattered through the text when they might be more impactful if focused. For example, what are the 4 or 5 most essential factors that someone in an MD/PhD or an MD program should weigh when they are deciding between clinical disciplines? There are also published data on the experience of past graduates in achieving a research-focused career in each clinical discipline. How should that data be applied by trainees? What are the factors that should be weighed in deciding where to work as a research-focused physician once training has been completed?

      We agree that the most critical decision-making factors were insufficiently distilled. To address this, we have made two major changes.

      First, we have added a new Table 1: “Key Decision Factors for Physician-Scientists Choosing a Clinical Specialty.” This table identifies five essential factors—(i) Alignment of Clinical Specialty with Research Focus, (ii) Structure of Clinical Work and Its Impact on Research Time, (iii) Availability of Structured Research Pathways and Mentorship, (iv) Institutional Environment and Culture, and (v) Financial Model and Long-Term Sustainability—and for each provides columns describing Why It Matters, What to Look For, and Potential Red Flags. This table is designed to be directly actionable for trainees comparing specialties and programs.

      Second, the Introduction now explicitly names these five factors as the organizing framework for the article and directs readers to Table 1 as a synthesis tool. The prior introductory paragraph, which previewed the article’s structure in a general way, has been replaced with a more focused synthesis.

      Regarding the published outcomes data: we have retained the specialty-specific outcomes data in what is now Table 2 (previously Table 1) and have added context in the text about how trainees should interpret these data—specifically, that published graduation and career outcome data provide a useful baseline but should be weighed alongside institutional context, since the same specialty can look very different at different institutions.

      Regarding factors for evaluating post-training positions: we have added a new paragraph in the section on Protected Research Time that addresses how trainees can evaluate the institutional environment at the faculty level, including specific metrics trainees can examine (see response to Points #4 and #5 below).

      (2) Some clinical fields at academic institutions have proved to be much more hospitable to careers as research-focused physicians than others. Published data highlight the challenges. I believe the authors have tried very hard to present a balanced perspective, but in the process, they have, I believe, missed an opportunity to guide trainees and make them aware of what they should look for to avoid making a decision that may prove incompatible with their long-term goals.

      We appreciate this candid observation and agree that our prior draft was overly cautious in this regard. In the revision, we have added a more explicit statement acknowledging that while successful physician-scientists exist across all specialties, the structural ease of sustaining a research-intensive career varies substantially by field. Specifically, we have added the following language to the section on Balancing Clinical and Research Responsibilities:

      “In practice, specialties with high procedural demands and unpredictable clinical schedules are often more challenging environments for sustaining research-intensive careers unless strong institutional protections are in place. While successful physician-scientists exist across all specialties, the structural ease of sustaining a research-intensive career varies substantially by field, and trainees should approach certain specialties with a clear understanding of the additional negotiation and institutional support required.”

      Additionally, the new Table 1 includes a “Potential Red Flags” column that gives trainees concrete warning signs to watch for when evaluating specialties and programs (e.g., departments primarily driven by clinical revenue with limited research infrastructure; absence of physician-scientists in leadership roles; inability to reduce clinical effort).

      (3) Where will be the jobs for physician-scientists who have an MD ± PhD and want to do research and discovery? How many openings will there be for physician-scientists in academia 5–10 years from now? In industry? How are recent events in Washington affecting the continuation of those jobs?

      after careful consideration, we believe that a detailed treatment of labor market projections, industry trends, and the effects of federal funding policy on the physician-scientist workforce falls outside the scope of this article, which is focused on the decision-making process for specialty selection. We note that the workforce question has been the subject of several recent analyses and commentaries (e.g., Milewicz et al., ASCI/AAP/APSA workforce reports) and feel that a thorough treatment would warrant a dedicated manuscript. We have not added this content but acknowledge the reviewer’s point in our thinking about future work.

      (4) Should one of the “smart choices” in the article’s title be where you do the residency, and not just which residency you do? How important is it to be at a successful, research-intensive medical center/university, both during and after residency and fellowship training? If being in an institution where there are numerous very successful physician-scientists and scientists improves the likelihood of being able to sustain a physician-scientist career, how should graduating students improve their chances of being at one of those institutions?

      This is an excellent point, and we agree that institutional environment is at least as important as specialty choice itself. We have made several changes to address this.

      In the Introduction, we have added the statement: “Importantly, the ability to sustain a physician-scientist career is often determined as much by the institutional environment and training program as by the specialty itself.” This signals early in the manuscript that “where” is as critical as “which.”

      In the new Table 1, we have included a row on “Institutional Environment and Culture” as one of the five key decision factors, with the explicit note that institutional commitment is often more determinative than specialty alone in enabling long-term success as a physician-scientist.

      We have also added a dedicated paragraph advising trainees to assess the broader institutional environment by examining: (i) the number of R01-funded investigators within the department, (ii) the presence of institutional training grants (e.g., T32 programs), and (iii) the track record of trainees transitioning from mentored (K) awards to independent (R) funding. We direct trainees to publicly available resources such as NIH RePORTER and the Blue Ridge Institute for Medical Research rankings.

      Finally, we have added a concluding sentence to the protected time section: “Taken together, these factors reinforce that institutional environment and departmental culture are often as determinative as specialty choice itself in shaping a sustainable physician-scientist career.”

      (5) In every clinical discipline, there are departments that value physician-scientists more than other departments and invest accordingly. What advice would the authors give to help graduating students identify those departments?

      This point is closely related to Point #4, and we have addressed it through the same set of revisions. The new paragraph on evaluating institutional environments provides concrete, actionable guidance for trainees on how to assess departmental commitment to physician-scientists, including specific metrics (R01 density, T32 presence, K-to-R transition rates) and publicly accessible tools (NIH RePORTER, Blue Ridge Institute rankings).

      The new Table 1 “Potential Red Flags” column highlights warning signs that a department may not be supportive of physician-scientist careers, including: departments primarily driven by clinical revenue (RVUs) with limited research infrastructure; lack of protected time enforcement; minimal NIH funding; and absence of physician-scientists in leadership roles.

      We have also expanded the existing discussion in the section on mentorship and residency selection, where we already noted the value of identifying departments with T32 grants and active physician-scientist mentors. The revised text now more explicitly connects these markers to the departmental evaluation process.

      We believe these revisions substantially strengthen the manuscript and are grateful for the reviewers’ constructive feedback.

    1. eLife Assessment

      This valuable study presents a tool that uses brain anatomy to predict the layout and size of early visual maps, and it is strengthened by the use of a large and diverse collection of scans to examine differences across people and groups. The evidence is solid for the general usefulness of the approach, but incomplete for some of the broader claims about prediction accuracy and use across data sets, particularly for estimates of map size and for showing that the model improves on repeated functional measurements. This paper is likely to be of significant interest to visual perception researchers, especially those who use fMRI.

    2. Reviewer #1 (Public review):

      Summary:

      This paper describes a deep learning toolbox that can be used to automatically estimate functional topographic maps directly from human brain anatomy. Building on the first author's earlier work, which demonstrated the feasibility of using deep learning for this purpose, the new version of the toolbox now requires only a single anatomical MRI scan to generate predictions, eliminating the need for a myelin scan. This represents a significant practical improvement.

      Strengths:

      Having such a toolbox is very useful, since manual annotation and delineation of functional visual field maps is a laborious process that also requires deep expertise. The toolbox can save researchers substantial amounts of time and money, and also allows less experienced researchers to now perform this type of analysis. Notably, for certain participants and patients, the time they are able to reside in the scanner might be limited. Being able to focus on the primary research question, rather than the essential yet basic topographic information, could boost data quality and evaluation and might limit the number of participants that need to be included.

      Weaknesses:

      In the paper, the authors compare the performance of their new version to two previous approaches. Figure 2b shows that the new toolbox performs similarly to the previous deep-learning-based toolbox, but requires only an anatomical scan, which is a significant improvement. They also compare it to an older method that uses an atlas without requiring deep learning. For eccentricity and pRF size predictions, both deep-learning methods perform better than the older approach. For polar angle, a critical parameter for delineating visual field maps, the gain is substantially less. Moreover, the comparison to the atlas method (Benson2014) is not entirely fair, as, to our knowledge, there is also a more advanced atlas version that uses Bayesian fitting methods and already performs better than the old method. To better understand the gain of using deep learning, it would be beneficial if the authors also made the comparison to this more recent atlas-based approach. Moreover, it would be useful to know the correlations for the representative participant. Some examples of relatively "bad" maps would also be useful to have (and could be provided as supplementary information).

      Figure 2b shows that the toolbox is quite good at estimating eccentricity and polar angle parameters, but less good at estimating the population receptive field (pRF) size. I will return to this latter point.

      An interesting feature is that while the toolbox is trained on a specific data set (HCP), it can, "out-of-the-box", be applied to different existing data sets, without the need to retrain the model. This is quite important for the general utility of the method. The results for this are shown in Figure 3. Again, in panel b, it can be seen that the toolbox does a good job at estimating eccentricity and polar angle values, but performs rather poorly for pRF size: the deepRetinotopy toolbox has a strong tendency to only estimate very small pRFs, particularly when applying it across different datasets. For this reason, at the moment, these estimates appear hardly useful. It would be very helpful for readers if the authors could clarify or elaborate on this point, particularly regarding the limitations of pRF size predictions. They explain that this could be due to the use of different types of stimuli, but even within the same (HCP) dataset, the predictions primarily suggest tiny pRFs, even though the training dataset also contains larger ones (which can be better seen in supplementary Figure 4). Showing the predictions for higher-order brain areas, which have larger pRFs on average, could serve a similar evaluation purpose. Presumably, the underlying reasons are complex and could relate to the use of different stimuli, different analysis toolboxes, and how the deep learning model is currently being trained. Possibly, the abundance of small pRFs at lower eccentricity in the training set (which is usually the case in any empirical analysis) has given the model a very strong bias toward predicting small pRFs.

      There would be various ways to verify which of these components is critical. For example, the model could be trained only on the bar stimuli of the HCP dataset, or the pRFs for all stimuli and datasets could be estimated using the same software tool. The latter seems important. For example, Supplementary Figure 4 indicates a high correlation between the Stanford and NYU cohorts that have used the same stimulus and analysis package, despite having different resolutions and scanners. Further investigation into the underlying reasons for these discrepancies would strengthen the paper. It would also provide valuable guidance for users of the toolbox on which toolbox predictions to trust and which not, as well as how well the model generalizes to other stimulus types, scanners, and image resolutions.

      An aspect that is not directly apparent from the title, abstract, and introduction is that the deepRetinotopy toolbox does not by itself produce estimates of visual area labels or boundaries. It predicts only polar angle and eccentricity values. To predict labels and boundaries, the authors combine the toolbox with an atlas (the aforementioned Bayesian atlas). For visual areas V1 - V3, it does a very good job, in that the predictions are as good as the empirical ones. Notably, the authors indicate that the predictions for V2 and, in particular, V3 are worse than for V1, but Figure 4 clearly shows that predictions are as good as the empirical ones. More cannot be expected from a model that is trained on such empirical data.

      Irrespective of the limitations with respect to predicting pRF size, the toolbox opens up functionally oriented analyses of very large cohorts of healthy participants, of which only anatomical data is available. The authors present an example of this by confirming the existence of differences in horizontal and vertical asymmetries in the field maps of the visual cortex of children and adults. While Figure 5 confirms the existence of differences, the analysis could be expanded to provide deeper insights, such as normalized developmental trajectories for both asymmetries, given the size of the dataset. This would better highlight the true power of their approach.

      While the authors address limitations with respect to studying experience-dependent atypical functional organization, they do not address how the deepRetinotopy toolbox would handle (acquired) brain lesions. Addressing this, even if only speculative, would be welcome. Another welcome addition would be to see the predictions for additional brain areas, even if those would (presumably) be worse at present. Such information would nevertheless be essential for users considering applying this toolbox. Moreover, this could be a valuable resource serving as a benchmark for future iterations of either deepRetinotopy or other approaches.

    3. Reviewer #2 (Public review):

      Summary:

      The authors introduce the deepRetinotopy toolbox, a deep learning-based software package that allows for user-friendly automatic delineation of visual areas based on anatomical (T1-weighted) MRI scans. This is an important evolution over a prior published version of the software, which required myelin maps additionally. The new version will hence allow many more users to obtain high-fidelity field-map delineations based on existing data or using standard protocols, providing a huge advance to the field. The authors exploited this strength and mapped visual field maps (for areas V1-V3) in 11060 human MRI scans covering different age classes to quantify changes of retinotopic organization across age groups, showing that previously functionally identified imbalances of early visual cortex field maps can now be identified on the basis of anatomical scans alone.

      Strengths:

      Overall, this is a tremendously important methodological contribution of primarily high practical and applied value. It allows functional imaging labs to delineate human cortical visual field maps with confirmed high fidelity using anatomical T1-weighted scans only. This will save expensive functional imaging and time-consuming analyses that were previously required to achieve nearly the same result and far better results than prior model-based approaches offered.

      Also, the quantification of the accumulated very large dataset is meticulous and provides impressively detailed results of the field map changes for areas V1-V3 as a function of age.

      Weaknesses:

      (1) The weak point of the contribution is the choice to limit anatomical quality assessments and error quantifications to just three early regions, V1-V3, even though the deepRetinotopy toolbox can delineate over 20 regions (including parietal, ventral, and lateral regions, such as IPS0-5, hV4, VO1-2, V3A, PHC1-2, LO1-2, and TO1-2).

      (2) The limit is fine for their large-scale application of the toolbox to age groups, as here, a clear hypothesis on early cortex variability was tested.

      (3) However, the introduction of the toolbox itself warrants quality assessments and comparisons to prior models and ground truth beyond V1-V3, just like the authors did in their prior publication of the predecessor model.

      (4) This is important as the vast majority of applications of this toolbox will likely go beyond V1-V3 to delineate dorsal, ventral, and lateral regions.

      (5) For the present paper, this will require only 1 or 2 additional figures, or extending their present figures 2 and 4 along the lines of their previous figure 7 (Ribeiro et al 2021), which included error measures for high-level regions. Ideally, you provide sub-graphs separately for early visual, dorsal, ventral, and lateral regions.

      (6) Going beyond V1-V3 is important for several reasons: first, future studies applying the software beyond V3 will need quantification for reassurance and justification. Second, for the sake of transparency, even if results are noisy or on par with prior models. Third, as a benchmark or reference point for future approaches.

    4. Reviewer #3 (Public review):

      Summary:

      This valuable study presents a tool that uses brain anatomy to predict the layout and size of early visual maps, and it is strengthened by testing across a large and diverse collection of scans. The work will be useful for researchers who want to estimate likely visual map layout from standard anatomical scans and to relate anatomical differences to differences in visual organization across groups. The evidence is solid for the general usefulness of the approach, but incomplete for broader claims about prediction accuracy and use across datasets, particularly for estimates of map size and for showing that the model improves on repeated functional measurements.

      Strengths:

      The paper addresses a useful and important problem: estimating early visual map organization from anatomical measurements alone. Tools that predict these types of functional data from anatomical measurements were introduced more than a decade ago by Benson and colleagues, and the present authors have significantly extended that work. That is a real strength of the manuscript, because there is genuine value in having a practical tool that can estimate likely visual organization from standard anatomical scans.

      Another major strength is the rigorous cross-dataset benchmarking and the accumulation of multiple datasets. The authors assembled a large and diverse set of scans and assessed model performance across different scanners, field strengths, and visual stimuli, which gives the reader a much better sense of how broadly the approach may apply. The retrospective analysis of more than 11,000 scans is especially notable and creates an unusual opportunity to ask how anatomical variation may relate to population differences in visual organization.

      I also think the paper does a good job of showing why such a tool could matter in practice. A complete tool could be used in several ways. First, it could help users identify the locations of activations measured in other experiments with respect to the typical V1-V3 maps. Second, maps measured from an individual subject or patient could be compared with the predictions from the tool to ask whether they differ meaningfully from a standard anatomy-based map. Third, the tool can be used, as the authors have done here, to examine differences in anatomy across populations and interpret these differences with respect to retinotopic maps. Of these uses, the first already seems well supported by the current presentation.

      Weaknesses:

      (1) Quantification of the Analysis

      My main concern is that the analysis relies heavily on global summary measures such as correlation and Dice score. Those measures are useful, but the paper would be more informative if it also quantified boundary differences in millimeters, especially for comparisons such as the V1/V2 boundary in Figure 2. That kind of analysis would help readers understand how large the errors are in physically meaningful terms.

      (2) Model fitting methods

      I also think the discussion of prediction failures for pRF size should be more explicit. The mismatch is likely influenced by the fact that the training data and several evaluation datasets were fit with different models and different analysis software. In particular, the network was trained on non-linear size estimates from the HCP data, while the comparison datasets were derived using other packages and, in some cases, different model assumptions. That likely contributes to the spread in Figure 3b and should be discussed more directly. It is important to discuss that the pRF parameters were derived using different software tools.

      - HCP dataset (training data): analyzePRF (Compressive Spatial Summation model)

      - NYU dataset: vistasoft

      - Stanford dataset: vistasoft

      - New Zealand dataset: SamSrf

      - CHN dataset: Custom MATLAB software

      (3) Clarifying Model Accuracy

      If deepRetinotopy generates a true "noise-removed" representation of functional mapping based on anatomy, then fitting it to one fMRI measurement should predict a second, independent fMRI run better than the noisy data from the first run does.

      The authors possess the exact data for this test. For the HCP dataset, the empirical fMRI data were explicitly separated into two halves: "fit 2" (the first half of the fMRI runs) and "fit 3" (the second half). They correlated these two halves to establish a "noise ceiling," the maximum possible reliability of the data. Looking at their results in Figure 2b, the correlation of the deepRetinotopy predictions falls below this noise ceiling. This means that the noisy functional Half 1 actually predicts functional Half 2 better than the anatomical model does.

      The authors should state this explicitly. A side-by-side plot of Half 1 predicting Half 2 versus deepRetinotopy predicting Half 2 would show that the anatomical model regularizes map location well, but misses reliable subject-specific variation that anatomy alone cannot capture.

      (4) The Hemodynamic Response Function

      The assumptions used to generate the original empirical maps are permanently baked into the deep learning model. However, the authors explicitly mention the hemodynamic response function (HRF) only once, noting in the Methods that the modeled time series was "convolved with a canonical hemodynamic response function."

      Beyond this single mention, there is no direct discussion of how the assumption of a single canonical HRF across all 161 HCP training subjects might have systematically impacted or biased the network's predictions. The authors address cross-dataset differences broadly under the umbrella of "experimental design" and "fMRI preprocessing pipeline" biases, but the HRF is a core biological property that mediates the connection between the anatomy and the data. The authors should explicitly discuss how this canonical assumption limits or biases the resulting deepRetinotopy network.

      (5) Scoping the Input Data and Normative Use

      The authors use FreeSurfer to generate a mean curvature map for the entire midthickness cortical surface. This full-hemisphere curvature map is resampled to a standard template surface space (32k_fs_LR), acting as the data frame that feeds input features into the neural network. However, while the network receives the full geometric structure of the hemisphere, it is explicitly trained to predict retinotopic parameters only within a restricted posterior ROI, based on the Wang et al. atlas and containing roughly 3,200 vertices per hemisphere.

      A useful experiment to try, and perhaps the authors have already considered this, would be to restrict the input features exclusively to the posterior vertices. Including all anterior vertices may make it harder for the network to fit the localized visual data. A brief commentary on why the full hemisphere was retained as input could be highly informative for researchers adapting this geometric deep learning pipeline.

    5. Author response:

      Public Reviews:

      Reviewer #1 (Public review):

      In the paper, the authors compare the performance of their new version to two previous approaches. Figure 2b shows that the new toolbox performs similarly to the previous deep-learning-based toolbox, but requires only an anatomical scan, which is a significant improvement. They also compare it to an older method that uses an atlas without requiring deep learning. For eccentricity and pRF size predictions, both deep-learning methods perform better than the older approach. For polar angle, a critical parameter for delineating visual field maps, the gain is substantially less. Moreover, the comparison to the atlas method (Benson2014) is not entirely fair, as, to our knowledge, there is also a more advanced atlas version that uses Bayesian fitting methods and already performs better than the old method. To better understand the gain of using deep learning, it would be beneficial if the authors also made the comparison to this more recent atlas-based approach. Moreover, it would be useful to know the correlations for the representative participant. Some examples of relatively "bad" maps would also be useful to have (and could be provided as supplementary information).

      We thank the reviewer for their constructive feedback. We plan to expand our benchmarking section to include the Bayesian model comparison. Note, however, that the additional accuracy gain afforded with the Bayesian model of retinotopy (Benson and Winawer, 2018) results from combining anatomical data with retinotopic maps estimated with a few minutes of functional data. The Bayesian model of retinotopy without such functional data is equivalent to Benson14. We plan to report the correlations (between predicted and empirical maps) for the representative participant shown in Figure 2 and include an additional supplementary figure showing retinotopic map predictions for a participant whose predictions deviate the most from empirical maps, as suggested by the reviewer.

      Figure 2b shows that the toolbox is quite good at estimating eccentricity and polar angle parameters, but less good at estimating the population receptive field (pRF) size. I will return to this latter point.

      An interesting feature is that while the toolbox is trained on a specific data set (HCP), it can, "out-of-the-box", be applied to different existing data sets, without the need to retrain the model. This is quite important for the general utility of the method. The results for this are shown in Figure 3. Again, in panel b, it can be seen that the toolbox does a good job at estimating eccentricity and polar angle values, but performs rather poorly for pRF size: the deepRetinotopy toolbox has a strong tendency to only estimate very small pRFs, particularly when applying it across different datasets. For this reason, at the moment, these estimates appear hardly useful. It would be very helpful for readers if the authors could clarify or elaborate on this point, particularly regarding the limitations of pRF size predictions. They explain that this could be due to the use of different types of stimuli, but even within the same (HCP) dataset, the predictions primarily suggest tiny pRFs, even though the training dataset also contains larger ones (which can be better seen in supplementary Figure 4). Showing the predictions for higher-order brain areas, which have larger pRFs on average, could serve a similar evaluation purpose. Presumably, the underlying reasons are complex and could relate to the use of different stimuli, different analysis toolboxes, and how the deep learning model is currently being trained. Possibly, the abundance of small pRFs at lower eccentricity in the training set (which is usually the case in any empirical analysis) has given the model a very strong bias toward predicting small pRFs.

      There would be various ways to verify which of these components is critical. For example, the model could be trained only on the bar stimuli of the HCP dataset, or the pRFs for all stimuli and datasets could be estimated using the same software tool. The latter seems important. For example, Supplementary Figure 4 indicates a high correlation between the Stanford and NYU cohorts that have used the same stimulus and analysis package, despite having different resolutions and scanners. Further investigation into the underlying reasons for these discrepancies would strengthen the paper. It would also provide valuable guidance for users of the toolbox on which toolbox predictions to trust and which not, as well as how well the model generalizes to other stimulus types, scanners, and image resolutions.

      We will expand our discussion of the limitations of pRF size prediction, highlighting that differences in visual stimuli, analysis toolboxes used to estimate pRF parameters from empirical data, and the current training of deepRetinotopy affect prediction accuracy. As the reviewer pointed out, the underlying reasons are complex, and it is difficult to isolate all the potential contributing factors. However, in addition to our expanded discussion, we also intend to present results from additional experiments that assess the impact of different loss functions on the range of predicted pRF sizes (to explain how training may partly account for the differences observed in the HCP dataset). We will also perform pRF fitting on at least one dataset using the same software/encoding model as in the HCP dataset (the training data) to illustrate that the lower performance in pRF size prediction in out-of-distribution datasets is also partly explained by differences in how the empirical maps were obtained.

      An aspect that is not directly apparent from the title, abstract, and introduction is that the deepRetinotopy toolbox does not by itself produce estimates of visual area labels or boundaries. It predicts only polar angle and eccentricity values. To predict labels and boundaries, the authors combine the toolbox with an atlas (the aforementioned Bayesian atlas). For visual areas V1 - V3, it does a very good job, in that the predictions are as good as the empirical ones. Notably, the authors indicate that the predictions for V2 and, in particular, V3 are worse than for V1, but Figure 4 clearly shows that predictions are as good as the empirical ones. More cannot be expected from a model that is trained on such empirical data.

      We will edit the introduction and abstract to make it clearer that the deepRetinotopy toolbox does not yet produce estimates of visual boundaries on its own.

      Irrespective of the limitations with respect to predicting pRF size, the toolbox opens up functionally oriented analyses of very large cohorts of healthy participants, of which only anatomical data is available. The authors present an example of this by confirming the existence of differences in horizontal and vertical asymmetries in the field maps of the visual cortex of children and adults. While Figure 5 confirms the existence of differences, the analysis could be expanded to provide deeper insights, such as normalized developmental trajectories for both asymmetries, given the size of the dataset. This would better highlight the true power of their approach.

      Although providing insights into developmental trajectories for horizontal and vertical asymmetries is beyond the scope of the current work, as it would require aggregating datasets such that individuals’ age span a larger range (ABCD dataset only contains individuals between 9-11 years old and the HCP Young Adult dataset between 22-36 years old), we plan to provide some complementary analyses (differences across ages and sex within the ABCD dataset).

      While the authors address limitations with respect to studying experience-dependent atypical functional organization, they do not address how the deepRetinotopy toolbox would handle (acquired) brain lesions. Addressing this, even if only speculative, would be welcome. Another welcome addition would be to see the predictions for additional brain areas, even if those would (presumably) be worse at present. Such information would nevertheless be essential for users considering applying this toolbox. Moreover, this could be a valuable resource serving as a benchmark for future iterations of either deepRetinotopy or other approaches.

      We plan to expand and report performance evaluation across other visual areas (using Wang atlas’ parcels) to serve as a benchmarking resource. Moreover, we will expand our discussion on how deepRetinotopy would handle brain lesions.

      Reviewer #2 (Public review):

      (1) The weak point of the contribution is the choice to limit anatomical quality assessments and error quantifications to just three early regions, V1-V3, even though the deepRetinotopy toolbox can delineate over 20 regions (including parietal, ventral, and lateral regions, such as IPS0-5, hV4, VO1-2, V3A, PHC1-2, LO1-2, and TO1-2).

      (2) The limit is fine for their large-scale application of the toolbox to age groups, as here, a clear hypothesis on early cortex variability was tested.

      (3) However, the introduction of the toolbox itself warrants quality assessments and comparisons to prior models and ground truth beyond V1-V3, just like the authors did in their prior publication of the predecessor model.

      (4) This is important as the vast majority of applications of this toolbox will likely go beyond V1-V3 to delineate dorsal, ventral, and lateral regions.

      (5) For the present paper, this will require only 1 or 2 additional figures, or extending their present figures 2 and 4 along the lines of their previous figure 7 (Ribeiro et al 2021), which included error measures for high-level regions. Ideally, you provide sub-graphs separately for early visual, dorsal, ventral, and lateral regions.

      (6) Going beyond V1-V3 is important for several reasons: first, future studies applying the software beyond V3 will need quantification for reassurance and justification. Second, for the sake of transparency, even if results are noisy or on par with prior models. Third, as a benchmark or reference point for future approaches.

      We thank the reviewer for their constructive feedback, and we agree that expanding our performance assessment beyond V1-3 would be a valuable benchmarking resource. Thus, we plan to evaluate retinotopic map prediction accuracy across visual areas defined by the Wang atlas’ parcels, expanding on the results reported in Figure 2, and provide it as a supplementary figure. However, performance estimation ultimately depends on the quality of the dataset used for evaluation. The empirical maps, although treated as ground truth, may themselves misrepresent the underlying retinotopic organization. As a matter of fact, the quality of the empirical data (HCP dataset and others) is indeed lowest in some of the higher-order visual areas.

      It may be unclear from the text that the deepRetinotopy toolbox does not yet produce estimates of visual boundaries on its own. Accordingly, we illustrate how deepRetinotopy toolbox’s predictions can be combined with another tool [the Ba yesian model of retinotopy from Benson and Winawer (2018)] to obtain visual area boundaries automatically. We will edit the introduction and abstract to make it clearer. Given the availability of empirical labels (currently only for V1-3) and the segmentation tool (which was only assessed for V1-3), we cannot expand Figure 4 to other visual areas as suggested.

      Reviewer #3 (Public review):

      Quantification of the Analysis: My main concern is that the analysis relies heavily on global summary measures such as correlation and Dice score. Those measures are useful, but the paper would be more informative if it also quantified boundary differences in millimeters, especially for comparisons such as the V1/V2 boundary in Figure 2. That kind of analysis would help readers understand how large the errors are in physically meaningful terms.

      We thank the reviewer for their constructive feedback. Following the reviewer’s suggestion, we plan to expand our segmentation evaluation to quantify the extent to which boundary predictions from deepRetinotopy’s maps deviate from those from empirical maps, in millimetres.

      Model fitting methods: I also think the discussion of prediction failures for pRF size should be more explicit. The mismatch is likely influenced by the fact that the training data and several evaluation datasets were fit with different models and different analysis software. In particular, the network was trained on non-linear size estimates from the HCP data, while the comparison datasets were derived using other packages and, in some cases, different model assumptions. That likely contributes to the spread in Figure 3b and should be discussed more directly. It is important to discuss that the pRF parameters were derived using different software tools.

      We will expand our discussion of the limitations of pRF size prediction, highlighting that differences in visual stimuli, different encoding models for estimating pRF parameters from empirical data, and the current training of deepRetinotopy affect prediction accuracy. In addition to our expanded discussion, we intend to also present results from additional experiments that assess the impact of those factors on pRF size prediction performance.

      Clarifying Model Accuracy: If deepRetinotopy generates a true "noise-removed" representation of functional mapping based on anatomy, then fitting it to one fMRI measurement should predict a second, independent fMRI run better than the noisy data from the first run does.

      The authors possess the exact data for this test. For the HCP dataset, the empirical fMRI data were explicitly separated into two halves: "fit 2" (the first half of the fMRI runs) and "fit 3" (the second half). They correlated these two halves to establish a "noise ceiling," the maximum possible reliability of the data. Looking at their results in Figure 2b, the correlation of the deepRetinotopy predictions falls below this noise ceiling. This means that the noisy functional Half 1 actually predicts functional Half 2 better than the anatomical model does.

      The authors should state this explicitly. A side-by-side plot of Half 1 predicting Half 2 versus deepRetinotopy predicting Half 2 would show that the anatomical model regularizes map location well, but misses reliable subject-specific variation that anatomy alone cannot capture.

      We will expand our benchmarking session to make these comparisons (“Half 1 predicting Half 2 versus deepRetinotopy predicting Half 2”) more explicit. It is important to highlight that there is more subject-specific variation that is currently not captured by our model, and it can also serve as a benchmarking resource for future model versions and newer approaches.

      The Hemodynamic Response Function: The assumptions used to generate the original empirical maps are permanently baked into the deep learning model. However, the authors explicitly mention the hemodynamic response function (HRF) only once, noting in the Methods that the modeled time series was "convolved with a canonical hemodynamic response function."

      Beyond this single mention, there is no direct discussion of how the assumption of a single canonical HRF across all 161 HCP training subjects might have systematically impacted or biased the network's predictions. The authors address cross-dataset differences broadly under the umbrella of "experimental design" and "fMRI preprocessing pipeline" biases, but the HRF is a core biological property that mediates the connection between the anatomy and the data. The authors should explicitly discuss how this canonical assumption limits or biases the resulting deepRetinotopy network.

      As Reviewers 3 and 1 have noted, the observed limitations in pRF size prediction stem from multiple underlying factors. One of those factors is indeed the HRF assumed in the encoding models. We will expand our discussion about factors that may introduce biases into deepRetinotopy predictions, including the HRF.

      Scoping the Input Data and Normative Use: The authors use FreeSurfer to generate a mean curvature map for the entire midthickness cortical surface. This full-hemisphere curvature map is resampled to a standard template surface space (32k_fs_LR), acting as the data frame that feeds input features into the neural network. However, while the network receives the full geometric structure of the hemisphere, it is explicitly trained to predict retinotopic parameters only within a restricted posterior ROI, based on the Wang et al. atlas and containing roughly 3,200 vertices per hemisphere.

      A useful experiment to try, and perhaps the authors have already considered this, would be to restrict the input features exclusively to the posterior vertices. Including all anterior vertices may make it harder for the network to fit the localized visual data. A brief commentary on why the full hemisphere was retained as input could be highly informative for researchers adapting this geometric deep learning pipeline.

      Thanks for this suggestion. We have not performed a systematic evaluation of using ROIs that span a larger portion of the cortex (including the full hemisphere). It is a great idea to do so and report it in our manuscript to inform other researchers interested in adapting our pipeline. We intend to also update our toolbox by retraining our models to take all posterior vertices as suggested, which would improve the coverage of current predictions.

    1. eLife Assessment

      This is an important and rigorous study that addresses the question of what determines the spatial organization of endocytic zones at synapses. The authors use compelling approaches, in both Drosophila and rodent model systems, to define the role of activity and active zone structure on the organization of the peri-active zone. While the findings are primarily negative, they are carefully executed and contribute to the field by refining existing models of presynaptic organization.

    2. Reviewer #1 (Public review):

      Summary:

      In this manuscript, Emperador-Melero et al. seek to determine whether recruitment of endocytic machinery to the periactive zone is activity-dependent or tethered to delivery of active zone machinery. They use genetic knockouts and pharmacological block in two model synapses - cultured mouse hippocampal neurons and Drosophila neuromuscular junctions - to determine how well endocytic machinery localizes after chronic inhibition or acute depolarization by super-resolution imaging. They find acute depolarization in both models have minimal to no effect on the localization of endocytic machinery at the periactive zone, suggesting that these proteins are constitutively maintained rather than upregulated in response to evoked activity. Interestingly, chronic inhibition slightly increases endocytic machinery levels, implying a potential homeostatic upregulation in preparation for rebound depolarization. Using genetic knockouts, the authors show that localization of endocytic machinery to periactive zones occurs independently of proper active zone assembly, even in the absence of upstream organizers like Liprin-α.

      Overall, they propose that the constitutive deployment of endocytic machinery reflects its critical role in facilitating rapid and reliable membrane internalization during synaptic functions beyond classical endocytosis, such as regulation of the exocytic fusion pore and dense-core vesicle fusion. Although many experiments reveal limited changes in the localization or abundance of endocytic machinery, the findings are thorough, and data substantially supports a model in which endocytic components are organized through a pathway distinct from that of the active zone. This work advances our understanding of synaptic dynamics by supporting a model in which endocytic machinery is constitutively recruited and regulated by distinct upstream organizers compared to active zone proteins. It also highlights the utility of super-resolution imaging across diverse synapse types to uncover functionally conserved elements of synaptic biology.

      Strengths:

      The study's technical strengths, particularly the use of super-resolution microscopy and rigorous image analyses developed by the group, bolster their findings.

      Weaknesses:

      One limitation, acknowledged by the authors, is the persistence of spontaneous activity at these synapses, which could still impact the organization of these regions.

      Comments on revisions:

      The authors have addressed all of my previous comments.

    3. Reviewer #2 (Public review):

      Summary:

      This study examines whether the localization of endocytic proteins to presynaptic periactive zones depends on synaptic activity or active zone scaffolds. Using genetic and pharmacological perturbations in both Drosophila and mouse neurons, the authors show that key endocytic proteins remain localized to periactive zones even when evoked release or active zone architecture is disrupted. While the findings are largely negative, the study is methodologically solid and provides useful constraints for current models of synaptic vesicle recycling.

      Strengths:

      The experimental design is careful and systematic, spanning both fly and mammalian systems. The use of advanced genetic models, including Liprin-α quadruple knockout mice, is a notable strength. High-resolution imaging approaches (STED, Airyscan) are appropriately applied to assess nanoscale organization. The study clarifies that strict activity dependence of endocytic recruitment may not be a general principle.

      Weaknesses (largely addressed in revision):

      Several initial concerns have been satisfactorily addressed in the revised manuscript. In particular, the inclusion of EndoA/Dap160 experiments and the expanded discussion improve the work. Some limitations remain, including the reliance on Tetanus toxin at the Drosophila NMJ, which does not fully abolish presynaptic fusion, and the still limited insight into the mechanistic basis of periactive zone organization. The biological interpretation of small changes in protein levels upon silencing also remains somewhat unclear.

      Comments on revisions:

      I thank the authors for the careful revision of the manuscript. The additional experiments, in particular the inclusion of EndoA and Dap160 at the Drosophila NMJ, as well as the extended discussion of limitations, are appreciated and address important points raised in the first round.

      While the principal conclusions of the study remain unchanged, and the manuscript is still largely based on negative results, I find that the authors now present these data in a more balanced and transparent manner. The discussion of activity-dependence is improved and more nuanced, especially with regard to possible contributions of spontaneous release and homeostatic effects.

      In my opinion, despite the mostly negative nature of the findings, the work provides a valuable and relevant contribution, as it defines important constraints on current models of periactive zone organization. The study is technically strong, carefully executed, and systematically performed across different model systems.

      Overall, the revised manuscript is clearly improved and represents a solid and well-executed piece of work that will be of interest to the field.

    4. Reviewer #3 (Public review):

      Summary:

      This study examines how synaptic endocytic zones are positioned using a combination of cultured neurons and the Drosophila neuromuscular junction. The authors test whether neuronal activity, active zone assembly, or liprin-α function is required to localize endocytic zone markers, including Dynamin, Amphiphysin, Nervous Wreck, PIPK1γ, and AP-180. None of the manipulations tested caused a coordinated disruption in the localization or abundance of these markers, leading to the conclusion that endocytic zones form independently of synaptic activity and active zone scaffolds.

      Strengths:

      The work is systematic and carefully executed, using multiple manipulations and two complementary model systems. The authors consistently examine multiple molecular markers, strengthening the interpretation that endocytic zone positioning is robust to changes in activity and structural assembly.

      Weaknesses:

      The main limitation is that the study does not test whether the methods used are sensitive enough to detect subtle functional disruption, and no condition tested produces clear disorganization of the endocytic zone. As a result, the conclusion that these zones assemble independently is supported by negative data, without a strong positive control for disassembly or mislocalization.

      This paper addresses a longstanding question in synaptic biology and provides a well-supported boundary on the types of mechanisms that are likely to govern endocytic zone localization. The conclusions are well justified by the data, though additional evidence would be needed to define the assembly mechanism itself.

      Comments on revisions:

      The authors responded to the initial review with care. They both revised the manuscript and conducted new experiments to address each reviewer's concern. The responses to the review were effective, and I think that the revised manuscript provides significant new insights. In my view, it does not require additional revisions.

    5. Author response:

      The following is the authors’ response to the original reviews.

      We thank the reviewers for their careful consideration of our work and constructive comments. We are glad that reviewers appreciated the rigor and value of our work. In response to the reviewer comments we have made the following changes:

      (1) Addition of new experiments on EndoA localization at the Drosophila NMJ (Fig. 2).

      (2) Addition of new experiments on Dap160 localization at the Drosophila NMJ (Fig. 2).

      (3) Addition of new experiments to validate Dynamin, Dap160 and EndoA antibodies (Fig. 2 – figure supplement 1).

      (4) Assessment of the activity-dependence of EndoA and Dap160 localization at the Drosophila NMJ (Fig. 3).

      (5) Assessment of the liprin-dependence of EndoA and Dap160 localization at the Drosophila NMJ (Fig. 8).

      (6) Addition of a limitations section to the discussion to directly address that spontaneous release was not fully ablated in our studies and might contribute to recruitment.

      (7) Addition of an outlook to the same section on what experimental avenues could address the limitations in the future.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      In this manuscript, Emperador-Melero et al. seek to determine whether recruitment of endocytic machinery to the periactive zone is activity-dependent or tethered to delivery of active zone machinery. They use genetic knockouts and pharmacological block in two model synapses - cultured mouse hippocampal neurons and Drosophila neuromuscular junctions - to determine how well endocytic machinery localizes after chronic inhibition or acute depolarization by super-resolution imaging. They find that acute depolarization in both models has minimal to no effect on the localization of endocytic machinery at the periactive zone, suggesting that these proteins are constitutively maintained rather than upregulated in response to transient activity. Interestingly, chronic inhibition slightly increases endocytic machinery levels, implying a potential homeostatic upregulation in preparation for rebound depolarization. Using genetic knockouts, the authors show that localization of endocytic machinery to periactive zones occurs independently of proper active zone assembly, even in the absence of upstream organizers like Liprin-α. Overall, they propose that the constitutive deployment of endocytic machinery reflects its critical role in facilitating rapid and reliable membrane internalization during synaptic functions beyond classical endocytosis, such as regulation of the exocytic fusion pore and dense-core vesicle fusion. Although many experiments reveal limited changes in the localization or abundance of endocytic machinery, the findings are thorough, and data substantially support a model in which endocytic components are organized through a pathway distinct from that of the active zone. This work advances our understanding of synaptic dynamics by supporting a model in which endocytic machinery is constitutively recruited and regulated by distinct upstream organizers compared to active zone proteins. It also highlights the utility of super-resolution imaging across diverse synapse types to uncover functionally conserved elements of synaptic biology.

      We thank the reviewer for the positive assessment of our study.

      Strengths:

      The study's technical strengths, particularly the use of super-resolution microscopy and rigorous image analyses developed by the group, bolster their findings.

      We thank the reviewer for highlighting the technical strength of our work.

      Weaknesses:

      One notable limitation, however, is the absence of interrogation of endocytic proteins previously suggested to be recruited in an activity-dependent manner, in particular, endophilin.

      We thank the reviewer for the suggestion. We have added experiments to assess the localization of two more proteins at Drosophila NMJs. These proteins are EndoA and Dap160, both of which have been reported to traffic between the synaptic vesicle cloud and the plasma membrane in response to stimulation [1-3]. In line with these studies, we observed that EndoA and Dap160 partially co-localize with a synaptic vesicle marker and with a periactive zone marker, indicating localization to both compartments (Fig. 2). However, neither high frequency stimulation nor expression of TeNT changed the levels or the distribution of these two proteins at the periactive zone (Fig. 3). Similarly, the deployment of these proteins at the periactive zone at the Drospophila NMJ was not dependent on the active zone scaffold Liprin-α (Fig. 8). Our data indicate that deployment of EndoA and Dap160 to the periactive zone does not require evoked synaptic activity.

      We believe that there are multiple plausible explanations for our findings compared to previous work on Endophilin, which we discuss on lines 407-410: “Increased synaptic enrichment was also observed for Endophilin at nematode NMJs in mutants with disrupted exocytosis (Bai et al., 2010). We do not see such large shifts in Endophilin following similar manipulations, which might reflect distinct synaptic architectures in the C. elegans dorsal cord versus Drosophila NMJ terminals.” Further, this study finds that a plasma membrane-tethered Endophilin strongly colocalizes with endocytic machinery and largely rescues function. This suggests that the plasma membrane is the primary functional compartment for Endophilin. Together with our work, we conclude that these data suggest that Endophilin constitutively, but not completely, localizes to the periactive zone.

      Reviewer #2 (Public review):

      Summary:

      This study examines whether the localization of endocytic proteins to presynaptic periactive zones depends on synaptic activity or active zone scaffolds. Using a combination of genetic and pharmacological perturbations in Drosophila and mouse neurons, the authors show that proteins such as Dynamin, Amphiphysin, AP-180, and others are still recruited to periactive zones even when evoked release or active zone architecture is disrupted. While the results are mostly negative, the study is methodologically solid and contributes to a more nuanced understanding of synaptic vesicle recycling machinery.

      We thank the reviewer for deeming our work solid and for highlighting its importance for the field.

      Strengths:

      (1) The experimental design is careful and systematic, covering both fly and mammalian systems.

      (2) The use of advanced genetic models (e.g., Liprin-α quadruple knockout mice) is a notable strength.

      (3) High-resolution imaging (STED, Airyscan) is well used to assess spatial localization.

      (4) The findings clarify that certain core assumptions - such as strict activity dependence of endocytic recruitment - may not hold universally.

      We thank the reviewer for pointing out these strengths.

      Weaknesses:

      (1) The study would benefit from a clearer positive control to demonstrate activity-dependent recruitment (e.g., Endophilin).

      We have added experiments to measure the localization of Endophilin, a protein previously reported to localize to the synaptic vesicle cloud [1], in Drosophila NMJs (Figs. 2 and 3). We observed that EndoA localized both to the synaptic vesicle cloud and to the periactive zone area. While stimulation did not enhance levels in either compartment, this outcome is not inconsistent with shuttling of protein between compartments during activity. Nevertheless, our data support a model in which EndoA, like the other tested endocytic proteins, is present at the periactive zone at rest.

      (2) The reliance on Tetanus toxin in the Drosophila NMJ experiments in my eyes is a limitation, as it does not block all presynaptic fusion events; this should be discussed more directly.

      We agree with the point of the reviewer. To more directly discuss it, we have included a “Limitations and Outlook” section in the revised version. We state that “conclusions that can be drawn on the roles of spontaneous release in periactive zone assembly remain limited” (lines 514-515). We further state that, while the manipulations that we included result in decreased spontaneous release, “it is possible that the remaining spontaneous release supports periactive zone assembly” (518-519) and that “Future studies might test manipulations with strong effects on miniature release including those affecting SNARE proteins and their regulators, with the caveat that these manipulations might have effects on upstream trafficking and in some cases on cell survival (Kaeser and Regehr, 2014; Santos et al., 2017).” (519-523).

      (3) The potential role of Dynamin in organizing other periactive zone proteins is not addressed and could be an important next step.

      We agree with the reviewer that this is an interesting possibility. On lines 454-455, we make the broad point that “interactions between endocytic proteins may further contribute to the anchoring of this apparatus”, and on lines 459-460, we specifically suggest a role for Dynamin by stating that “perturbing interactions between Dynamin-1 and Endophilin-A1 increases the distance between these proteins (Imoto et al., 2024), suggesting their binding has a scaffolding function.”

      (4) Some small changes in protein levels upon silencing are reported; their biological meaning (e.g., compensation vs. variability) is not fully clarified.

      These changes might include homeostatic adaptations. In the revised version of the manuscript, this is addressed on lines 135-137 and 405-407. We think it is overall difficult to assign biological meaning to small-magnitude changes, and chose to highlight the main point that there are no large-magnitude changes.

      (5) While alternative organizing mechanisms (actin, lipids, adhesion molecules) are mentioned, a more forward-looking discussion of how to test these models would be helpful.

      Following the reviewer’s suggestion, we have added an outlook section to the discussion where we provide suggestions for future studies (lines 510-543).

      (6) The authors should consider including, or at least discussing, a well-established activity-dependent endocytic protein (e.g., Endophilin) as a positive control to help contextualize the negative findings.

      We have included new experiments on EndoA at the fly neuromuscular junction (Fig. 2, Fig. 3, Fig. 8, Fig. 3 – figure supplement 1) and have added appropriate discussion of these findings as outlined above.

      Reviewer #3 (Public review):

      Summary:

      This study examines how synaptic endocytic zones are positioned using a combination of cultured neurons and the Drosophila neuromuscular junction. The authors test whether neuronal activity, active zone assembly, or liprin-α function is required to localize endocytic zone markers, including Dynamin, Amphiphysin, Nervous Wreck, PIPK1γ, and AP-180. None of the manipulations tested caused a coordinated disruption in the localization or abundance of these markers, leading to the conclusion that endocytic zones form independently of synaptic activity and active zone scaffolds.

      We thank the reviewer for reviewing our work.

      Strengths:

      The work is systematic and carefully executed, using multiple manipulations and two complementary model systems. The authors consistently examine multiple molecular markers, strengthening the interpretation that endocytic zone positioning is robust to changes in activity and structural assembly.

      We thank the reviewer for pointing out these strengths.

      Weaknesses:

      The main limitation is that the study does not test whether the methods used are sensitive enough to detect subtle functional disruption, and no condition tested produces clear disorganization of the endocytic zone. As a result, the conclusion that these zones assemble independently is supported by negative data, without a strong positive control for disassembly or mislocalization.

      We are confident that our methods are sensitive enough to detect changes within synaptic compartments. First, for mouse neurons assessed with STED microscopy, we have demonstrated that we can distinguish between the N- and the C-termini of the presynaptic protein Bassoon, which are positioned only a few tens of nanometers apart [4]. We have subsequently been consistently able to resolve the localization of pre- and postsynaptic proteins that also localize a few tens of nanometers apart and have established that genetic manipulations of active zone proteins induce detectable disruptions as assessed by STED microscopy [4-12]. Given that the periactive zone is larger than the distances that we can resolve, we are confident that we can detect changes in this area with enough sensitivity. Second, for Drosophila NMJs, we use a carefully validated workflow that allows assessing the distribution of periactive zone proteins and can detect subtle changes [13]. Unfortunately, there are no known manipulations that lead to periactive zone disassembly that could serve as a positive control, which reflects the little knowledge available in this field. We acknowledge that there may be subtle changes in protein localization that escape the resolution of our microscopy methods or experimental design, but this would not undermine the conclusion that the periactive zone remains assembled across the manipulations that we have tested. Overall, none of the manipulations we test induces a detectable disruption of the periactive zone. Naturally, we cannot exclude milder effects and have added a limitations section to discuss this possibility and some of the subtle changes we observe.

      This paper addresses a longstanding question in synaptic biology and provides a well-supported boundary on the types of mechanisms that are likely to govern endocytic zone localization. The conclusions are well justified by the data, though additional evidence would be needed to define the assembly mechanism itself.

      We thank the reviewer for the support of the conclusion of our study.

      Recommendations for the authors:

      Reviewing Editor Comments:

      This is a rigorous study that, while presenting largely negative data, delimitates the processes that control peri-active zone organization. In addition to the interpretive and technical comments below, we encourage the authors to consider extending this study in two areas. First, examining the activity-dependence of Endophilin, and perhaps other factors, being recruited to the PAZ, where previous research has indicated a positive role for activity. Second, further characterization of the role of miniature release events in potentially contributing to PAZ organization. Overall, this was a rigorous and well-executed study.

      We thank the reviewing editor for this positive assessment of our work.

      Reviewer #1 (Recommendations for the authors):

      (1) The rationale for comparing chronic inhibition to acute depolarization could be more clearly articulated. While this approach may be grounded in prior studies, the physiological consequences of chronic silencing differ markedly from those of transient activity, and these distinctions should be more explicitly addressed in the interpretation of results. For example, might lower intensity, chronic stimulation be a better comparison? Since fixation takes place immediately after stimulation, the time window to capture changes in protein recruitment may be curtailed.

      We thank the reviewer for this comment. The introduction of the manuscript now includes a rationale on lines 110-112. By inhibiting evoked synaptic vesicle fusion throughout the lifespan of neurons, we assessed whether this process is necessary for periactive zone assembly and concluded that it is not a requirement. By acutely depolarizing neurons with 50 mM KCl or with a 40 Hz train of action potentials, we were able to test whether synaptic vesicle fusion triggers the rapid recruitment of endocytic proteins to the periactive zone and concluded that this is not the case for most of the endocytic proteins that we studied. While these results indicate that a constitutive pathway must exist to assemble the periactive zone, we remain agnostic as to whether stimulation paradigms not tested in our study can enhance the deployment of endocytic proteins, especially over long periods of time. This may be the case for low, chronic stimulation, as suggested by the reviewer. We clarify these limitations on a “limitations and outlook” section of the discussion (lines 510-543).

      (2) Amphiphysin stood out as the only protein showing a notable change in opposite directions under either active zone protein knockout/blockers and Liprin-α knockout. Given the predominance of negative results, it would be valuable to devote more discussion to why Amphiphysin behaves differently. What functional role might it play in this context that sets it apart from other endocytic components?

      As suggested by the reviewer, we have extended the discussion on Amphiphysin. One possibility why Amphiphysin may respond differently to different genetic manipulations or changes in stimulation is that different endocytic proteins might belong to different endocytic submachineries. This is addressed on lines 421-424. On lines 444-449, we further discuss the subtle decrease in the levels of Amphiphysin and AP-180 in Liprin-α mutants. We suggest that the actin cytoskeleton may be the link between the active zone and the endocytic apparatus, and that this link may be partially disrupted in Liprin-α mutants. Overall, we note that Amphiphysin is still localized to the periactive zone at rest, and hence that it fits with the overall model of constitutive deployment that we propose.

      (3) The claim of activity-independence may need to be nuanced. Although the data suggest no recruitment in response to acute stimulation, the subtle changes following chronic inhibition complicate this interpretation, especially when considering redundancy. If activity-dependence is considered bidirectional, these findings might reflect a more complex regulatory mechanism. The interpretation in lines 188-190 more accurately captures this complexity than earlier generalizations.

      We agree with the reviewer that the dependence on activity should be discussed in a nuanced fashion. We have scrutinized the manuscript on this point and state throughout that recruitment is independent of evoked activity and not necessarily of any kind of activity. We believe that this interpretation is accurate because evoked release of neurotransmitter was ablated by the pharmacological and genetic manipulations that we used. Furthermore, we have included a “Limitations of the study” section in the discussion where we openly address that spontaneous fusion of synaptic vesicles cannot be ruled out as a potential mechanism to sustain periactive zone assembly (lines 514-523). Finally, we have expanded on the complexity of periactive zone assembly relative to activity. In particular, homeostasis may contribute to increased levels of endocytic proteins upon chronic blockade of evoked transmission (lines 404-406).

      (4) Given published work on endophilin's role in activity-dependent endocytic recruitment, adding endophilin (at least in the Drosophila NMJ experiments) would be highly informative.

      We thank the reviewer for the suggestion. We have added experiments to assess the localization of two more proteins at Drosophila NMJs. These proteins are EndoA and Dap160, both of which have been reported to traffic between the synaptic vesicle cloud and the plasma membrane in response to stimulation [1-3]. In line with these studies, we observed that EndoA and Dap160 partially co-localize with a synaptic vesicle marker and with a periactive zone marker, indicating localization to both compartments (Fig. 2). However, neither high frequency stimulation nor expression of TeNT changed the levels or the distribution of these two proteins at the periactive zone (Fig. 3). Similarly, the deployment of these proteins at the periactive zone at the Drosophila NMJ was not dependent on the active zone scaffold Liprin-α (Fig. 8). Our data indicate that deployment of EndoA and Dap160 to the periactive zone does not require evoked synaptic activity.

      We believe that there are multiple plausible explanations for these findings compared to previous work on Endophilin [3], which we discuss on lines 407-410:

      “Increased synaptic enrichment was also observed for Endophilin at nematode NMJs in mutants with disrupted exocytosis (Bai et al.,2010). We do not see such large shifts in Endophilin following similar manipulations, which might reflect distinct synaptic architectures in the C. elegans dorsal cord vs Drosophila NMJ terminals.” Further, this study finds that a plasma membrane-tethered Endophilin strongly colocalizes with endocytic machinery and largely rescues function. This suggests that the plasma membrane is the primary functional compartment for Endophilin. Together, all data are compatible with a model in which Endophilin constitutively, but not completely, localizes to the periactive zone.

      (5) Line 57 might have a typo in the citation.

      We thank the reviewer for pointing this out. The citations now include: Bai et al., 2010; Jiang et al., 2024; Koh et al., 2007; Winther et al., 2013 and Winther et al. 2015. Please note that these two last citations are grouped as Winther et al. 2013, 2015 following our formatting style.

      (6) Line 208 might be missing a citation that justifies parameters.

      In the revision, this information is discussed on lines 222-224, where we cite our prior work describing these data: “Each unit is divided into ‘mesh’ and ‘core’ regions, where the periactive zone mesh is a ~175 nm wide area localized at ~330 nm from the center, and the ‘core’ region is the interior to this mesh (Del Signore et al., 2023)”.

      Reviewer #2 (Recommendations for the authors):

      (1) Please consider including, or at least discussing, a well-established activity-dependent endocytic protein (e.g., Endophilin) as a positive control to help contextualize the negative findings.

      We thank the reviewer for the suggestion. We have added experiments to assess the localization of two more proteins at Drosophila NMJs. These proteins are EndoA and Dap160, both of which have been reported to traffic between the synaptic vesicle cloud and the plasma membrane in response to stimulation [1-3]. In line with these studies, we observed that EndoA and Dap160 partially co-localize with a synaptic vesicle marker and with a periactive zone marker, indicating localization to both compartments (Fig. 2). However, neither high frequency stimulation nor expression of TeNT changed the levels or the distribution of these two proteins at the periactive zone (Fig. 3). Similarly, the deployment of these proteins at the periactive zone at the Drosophila NMJ was not dependent on the active zone scaffold Liprin-α (Fig. 8). Our data indicate that deployment of EndoA and Dap160 to the periactive zone does not require evoked synaptic activity.

      We believe that there are multiple plausible explanations for our findings compared to previous work on Endophilin [3], which we discuss on lines 407-410: “Increased synaptic enrichment was also observed for Endophilin at nematode NMJs in mutants with disrupted exocytosis (Bai et al.,2010). We do not see such large shifts in Endophilin following similar manipulations, which might reflect distinct synaptic architectures in the C. elegans dorsal cord vs Drosophila NMJ terminals.” Further, this study finds that a plasma membrane-tethered Endophilin strongly colocalizes with endocytic machinery and largely rescues function. This suggests that the plasma membrane is the primary functional compartment for Endophilin. Together, all data are consistent with a model in which Endophilin constitutively, but not completely, localizes to the periactive zone.

      (2) Expand the discussion of TeNT's limitations-specifically that it does not block spontaneous fusion or alternative fusion pathways-and consider referencing more stringent tools (e.g., Botulinum toxins or SNARE mutants), even if they weren't used here.

      Following the reviewer’s suggestion, we have included a “Limitations and Outlook” section in the revised version. We state that “conclusions that can be drawn on the roles of spontaneous release in periactive zone assembly remain limited” (lines 514-515). We further state that, while the manipulations that we included result in decreased spontaneous release, “it is possible that the remaining spontaneous release supports periactive zone assembly” (518-519) and that “Future studies might test manipulations with strong effects on miniature release including those affecting SNARE proteins and their regulators, with the caveat that these manipulations might have effects on upstream trafficking and in some cases on cell survival (Kaeser and Regehr, 2014; Santos et al., 2017)” (520-523).

      (3) We encourage the authors to briefly discuss whether Dynamin might contribute to periactive zone structure beyond its role in membrane fission. Loss-of-function data could be particularly informative in future work.

      We agree with the reviewer that this is an interesting possibility. On lines 454-455, we make the broad point that “interactions between endocytic proteins may further contribute to the anchoring of this apparatus”, and on lines 459-460, we specifically suggest a role for Dynamin by stating that “perturbing interactions between Dynamin-1 and Endophilin-A1 increases the distance between these proteins (Imoto et al., 2024), suggesting their binding has a scaffolding function.”

      (4) Clarify the interpretation of increased endocytic protein levels upon chronic silencing - are these interpreted as homeostatic responses or experimental variability?

      We suggest that these changes might include homeostatic adaptations. We note that this increase is of the same magnitude as the increase in active zone proteins following a similar pharmacological manipulation on lines 405-406, where we state that “a mechanism for this effect might be a homeostatic response (Wen and Turrigiano, 2024) similar in magnitude to the increase in active zone protein levels following activity blockade (Held et al., 2020).”

      (5) The Discussion could be strengthened by sketching out more concrete experimental approaches to test candidate mechanisms (e.g., roles for actin, lipids, adhesion molecules) in organizing periactive zones.

      The potential roles of the cell adhesion molecules (lines 430-440), cytoskeleton and lipids (442-452) are addressed in the discussion. Furthermore, following the reviewer’s suggestion, we have added the following statement (lines 541-543): “This work builds a foundation to assess alternative mechanisms and models of periactive zone assembly, including roles of the cytoskeleton, lipids, adhesion molecules, and intrinsic endocytic protein interactions”. We hope that the reviewer agrees that the discussion of our paper is not the right format to provide a concrete experimental plan for future work. In our view, the discussion should put the findings of our experiments in the context of the field.

      Reviewer #3 (Recommendations for the authors):

      (1) At a spine synapse, the endocytic zone is estimated to be between 100-200nm from the active zone. The focus of the author's analysis is largely outside of this region (0-150nm), raising the question of whether the area studied may be outside of the area affected by the manipulations made. While STED systems claim ~80 nm resolution, this is rarely achieved in practice, and the authors do not report the effective resolution of their system. Reporting the resolution achieved would address this issue. In addition, super-resolution imaging does not appear to have been used at the Drosophila NMJ. The authors should clarify whether resolution limitations influenced the choice of analysis region and whether their imaging approach is sufficient to detect changes in the endocytic zone.

      We believe that it is unlikely that the relevant signals were missed. First, in mouse synapses, most signal corresponding to endocytic proteins was detected inside the selected region of interest. Our rationale to select the area was based on the fact that expanding the region analyzed would have reduced the sensitivity of our approach, as averaging over a larger area would dilute the signal. The resolution of our microscopy should not be a limitation either. In our previous work, we demonstrated that STED microscopy allows discriminating between the N- and the C-terminal termini of the presynaptic scaffold Bassoon, which are positioned only a few tens of nanometers apart [4]. This establishes that we can resolve differences at tens of nanometers in biological context, which is more relevant than the resolution measured with fluorescent beads (which we have repeatedly assessed to be ~80 nm laterally). Subsequently, we have also been consistently able to resolve the localization of pre- and postsynaptic proteins that also localize a few tens of nanometers apart [4-12]. Given that the periactive zone spans over a larger area than the distances that we can resolve experimentally in the examples above, we are confident that our measurements are sensitive enough to detect changes in this area.

      Second, for Drosophila NMJs, the choice for the region of interest and the overall analysis was done following a workflow validated in our previous work [13]. This method analyzes both immediately adjacent and more distant regions from the active zone, and does not exclude any region based on distance from the active zone as described on lines 222-224: “Each unit is divided into ‘mesh’ and ‘core’ regions, where the periactive zone mesh is a ~175 nm wide area localized at ~330 nm from the center, and the ‘core’ region is the interior to this mesh (Del Signore et al., 2023).” In our previous study, we analyzed the distribution of periactive zone proteins at rest with STED microscopy and with Airyscan confocal microscopy. The resolution provided by Airyscan is reported to be ~175 nm in XY and ~400 nm in Z, which is sufficient to assess localization to the periactive zone compartment imaging methods and is not inferior to imaging methods previously used to report changes in the distribution of endocytic proteins; for examples, see [1,2]. In the revised manuscript, we have added new data measuring the levels and distribution of EndoA and Dap160 using STED microscopy (Figure 3 – figure supplement 1). The results acquired with STED microscopy and with Airyscan confocal microscopy are consistent with one another.

      Overall, the accuracy of the imaging methods and analyses used in this study are sufficient to assess periactive zone structure given its size and organization.

      (2) Interestingly, in a number of cases, the authors observe significant differences in endocytic markers (Figure 1q, 4k, 6k, 6r). However, little is made of these differences. The authors should provide more discussion of these changes and how they make sense of them alongside their claims of a lack of effect from their manipulations.

      The reviewer raises a good point. We interpret these changes in two different ways. First, we suggest that changes observed in response to block of action potentials or disassembly of the active zone might be homeostatic. This is addressed on lines 135-137. Second, we discuss that the actin cytoskeleton may be the link between the active zone and the endocytic apparatus. Several active zone proteins interact with the actin cytoskeleton. One of them is Liprin-α. This interaction may explain the decrease in the level of Amphiphysin and AP-180 at the periactive zone in Liprin-α null neurons. This is addressed on lines 444-449. We hope that the reviewer agrees that overall, we should focus on the main conclusion that deployment of endocytic proteins persists over a number of manipulations and synapse types.

      (3) The graphs in Figure 1c and 1g, 3g, 4c, 4e, 6c, and 6g do not appear to be identical. If the solid line represents the mean and the lighter color represents the distribution of these data, these data appear to be different from one another. It is surprising that these differences are not significant. What statistical tests were used to determine whether the differences in these graphs are not significant? Is the issue that a relatively now number of synapses were examined (30-60)? Did the authors conduct a power analysis?

      We apologize if the display of our data and analyses was not clear. We do not perform statistical analyses on the line profiles. Instead, we perform it on two values that are extracted from line profiles. These values are (1) the distance between the peak intensity values of the protein of interest and the marker and (2) the peak intensity values. For example, in Figure 1, distances are quantified and statistically analyzed in panel j, and the peak levels are quantified and statistically analyzed in panel k. We have clarified this in the legend of current Figures 1, 4, 5, and 7.

      (4) The authors clearly state that their experiments address the role of evoked activity in endocytic zone positioning, but they do not examine whether spontaneous vesicle fusion might play a role. Given the availability of Drosophila mutants that decrease (Doc2, Dunc-13) or increase (syt1) spontaneous release, this is a notable omission. Ideally, these mutants should be examined. And at a minimum, the authors should discuss whether spontaneous release could contribute to endocytic zone organization.

      We agree with the reviewer that spontaneous fusion of synaptic vesicles may contribute to periactive zone organization. Many of the genetic manipulations that we used in mouse neurons result in a significant decrease in spontaneous release. This includes Ca<sub>V</sub>2 triple knockouts with a ~60% decrease in spontaneous fusion [10], RIM+ELKS quadruple knockouts with a ~70% decrease in spontaneous fusion [9] and Liprin-α quadruple knockouts with a ~50% decrease in spontaneous fusion [7]. We cannot rule out that the spontaneous release that is left is sufficient to mediate assembly functions. The conclusive way to address this possibility is using a manipulation that ablates spontaneous release without altering other pathways. However, to our knowledge, this is not available. The manipulations suggested by the reviewer might suffer from similar limitations, as they would change the frequency of spontaneous release without fully ablating it, and they would also affect evoked release. We have included a limitations section in the discussion where we address this (lines 514-523), specifically stating “conclusions that can be drawn on the roles of spontaneous release in periactive zone assembly remain limited. While many of the manipulations used here, including Ca<sub>V</sub>2 knockout (Held et al., 2020), RIM+ELKS knockout (Tan et al., 2022; Wang et al., 2016) and Liprin-α knockout (Emperador-Melero et al., 2024) in hippocampal neurons, and TeNT expression in fly NMJs (Sweeney et al.,1995) , result in 50% to 70% decreased spontaneous release rates, it is possible that the remaining spontaneous release supports periactive zone assembly. Future studies might test manipulations with strong effects on miniature release including those affecting SNARE proteins and their regulators, with the caveat that these manipulations might have effects on upstream trafficking and in some cases on cell survival (Kaeser and Regehr, 2014; Santos et al., 2017).” We hope that the reviewer agrees that assessing these mutants should be a topic of future studies, given that we already test many mutants in the paper.

      (5) In Figures 1 and 6, the authors assess presynaptic protein localization in cultured neurons, but it is unclear whether these are synaptic sites. Many presynaptic proteins traffic together and can accumulate at sites lacking postsynaptic specializations. The authors should validate that the observed spatial organization occurs at bona fide synapses, ideally by co-labeling with postsynaptic markers as done in Figure 4. If methods like these were used, providing more details on how synapses were identified and selected would be useful to the reader.

      While we understand the reviewer’s point, we are confident that the structures analyzed are bona fide synapses for three reasons, as we have established before across many papers [4-8,10-12,17].

      The diameter of the structures detected using the synaptic vesicle marker Synaptophysin aligns much more closely with the size of the large vesicle clusters found at presynaptic terminals than with that of a few transport vesicles.

      In side-view synapses, the bar-like distribution of the active zone marker (Bassoon or Munc13-1) at one edge of the vesicle cloud indicates that active zone proteins are organized at one edge of the vesicle cluster—consistent with the architecture of synapses.

      Synaptophysin is one of our key markers for detecting synapses. In our cultures, most of the Synaptophysin signal colocalizes with postsynaptic markers (either PSD-95 or Gephyrin), as we have established across many studies [4,7-12]. This indicates that the markers used here are sufficient to select synapses. Furthermore, the frequency at which synapses were identified using an active zone marker as the second marker was similar to that observed when using a postsynaptic marker, suggesting that we were not randomly including unrelated structures.

      (6) Many of the images, particularly of the Drosophila NMJ, are of low quality and are shown in very small images. In addition, the quality of the images throughout the paper makes it difficult to assess the author's analysis and results. The authors should provide larger, higher-quality images that show examples of the means for each of the examples shown. This is an issue for most of the figures, but is particularly prominent in the dNMJ. A minor additional point is that the authors should be clear whether the dNMJ images are collected at super-resolution or using a conventional microscope.

      We believe that the quality of our images is sufficient for the assessments made for the following reasons:

      These images were acquired with enough spatial resolution to assess levels at the PAZ as discussed in response to this reviewer’s first comment. In our previous work, we used images acquired at the same resolution and presented in the same manner for both mouse hippocampal synapses [6,7] and Drosophila NMJs [13,18]. In those previous studies, we drew conclusions at a similar level of detail as in the current study.

      In our view, our representative images are not inferior in quality to other papers in the field addressing similar questions [1,2,19,20].

      We have selected sample images based on the quantified mean values per condition. Hence, we strived to select panels that are objectively representative regarding the quantified parameters.

      We have specified microscopy methods in the figure legends. Specifically, for Drosophila NMJs, we used Airyscan confocal microscopy and STED microscopy. For each experiment, it is now stated which microscopy method was used in the corresponding legend.

      References:

      (1) Winther, Å. M. E. et al. An Endocytic Scaffolding Protein together with Synapsin Regulates Synaptic Vesicle Clustering in the Drosophila Neuromuscular Junction. J Neurosci 35, 14756–14770 (2015).

      (2) Winther, Å. M. E. et al. The dynamin-binding domains of Dap160/intersectin affect bulk membrane retrieval in synapses. J Cell Sci 126, 1021–1031 (2013).

      (3) Bai, J., Hu, Z., Dittman, J. S., Pym, E. C. G. & Kaplan, J. M. Endophilin functions as a membrane-bending molecule and is delivered to endocytic zones by exocytosis. Cell 143, 430–441 (2010).

      (4) Wong, M. Y. et al. Liprin-alpha3 controls vesicle docking and exocytosis at the active zone of hippocampal synapses. Proc Natl Acad Sci U S A 115, 2234–2239 (2018).

      (5) Emperador-Melero, J., de Nola, G. & Kaeser, P. S. Intact synapse structure and function after combined knockout of PTPδ, PTPσ, and LAR. Elife 10, (2021).

      (6) Emperador-Melero, J. et al. PKC-phosphorylation of Liprin-α3 triggers phase separation and controls presynaptic active zone structure. Nat Commun 12, 3057 (2021).

      (7) Emperador-Melero, J. et al. Distinct active zone protein machineries mediate Ca2+ channel clustering and vesicle priming at hippocampal synapses. Nature Neuroscience 2024 1–15 (2024) doi:10.1038/s41593-024-01720-5.

      (8) Tan, C., Wang, S. S. H., de Nola, G. & Kaeser, P. S. Rebuilding essential active zone functions within a synapse. Neuron 110, 1498-1515.e8 (2022).

      (9) Wang, S. S. H. et al. Fusion Competent Synaptic Vesicles Persist upon Active Zone Disruption and Loss of Vesicle Docking. Neuron 91, 777–791 (2016).

      (10) Held, R. G. et al. Synapse and Active Zone Assembly in the Absence of Presynaptic Ca(2+) Channels and Ca(2+) Entry. Neuron 107, 667-683.e9 (2020).

      (11) Chin, M. & Kaeser, P. S. The intracellular C-terminus confers compartment-specific targeting of voltage-gated calcium channels. Cell Rep 43, 114428 (2024).

      (12) Nyitrai, H., Wang, S. S. H. & Kaeser, P. S. ELKS1 Captures Rab6-Marked Vesicular Cargo in Presynaptic Nerve Terminals. Cell Rep 31, 107712 (2020).

      (13) Del Signore, S. J., Mitzner, M. G., Silveira, A. M., Fai, T. G. & Rodal, A. A. An approach for quantitative mapping of synaptic periactive zone architecture and organization. Mol Biol Cell 34, (2023).

      (14) Sweeney, S. T., Broadie, K., Keane, J., Niemann, H. & O’Kane, C. J. Targeted expression of tetanus toxin light chain in Drosophila specifically eliminates synaptic transmission and causes behavioral defects. Neuron 14, 341–351 (1995).

      (15) Kaeser, P. S. & Regehr, W. G. Molecular mechanisms for synchronous, asynchronous, and spontaneous neurotransmitter release. Annu Rev Physiol 76, 333–363 (2014).

      (16) Santos, T. C., Wierda, K., Broeke, J. H., Toonen, R. F. & Verhage, M. Early Golgi Abnormalities and Neurodegeneration upon Loss of Presynaptic Proteins Munc18-1, Syntaxin-1, or SNAP-25. Journal of Neuroscience 37, 4525–4539 (2017).

      (17) de Jong, A. P. H. et al. RIM C2B Domains Target Presynaptic Active Zone Functions to PIP2-Containing Membranes. Neuron 98, 335-349.e7 (2018).

      (18) Del Signore, S. J. et al. An autoinhibitory clamp of actin assembly constrains and directs synaptic endocytosis. Elife 10, (2021).

      (19) Imoto, Y. et al. Dynamin 1xA interacts with Endophilin A1 via its spliced long C-terminus for ultrafast endocytosis. EMBO Journal https://doi.org/10.1038/S44318-024-00145-X

      (20) Imoto, Y. et al. Dynamin is primed at endocytic sites for ultrafast endocytosis. Neuron 110, 2815-2835.e13 (2022).

    1. eLife Assessment

      This potentially useful manuscript addresses the 3D chromatin architecture in monocytes from a few patients with alcohol-associated hepatitis and its relationship to enhanced transcription of innate immune genes. While the concept and methodological approach are interesting in principle, the evidence is incomplete as a result of insufficient sample sizes as well as other substantive analytical concerns.

    2. Reviewer #3 (Public review):

      In this manuscript, the authors use HiC to study the 3D genome of CD14+ CD16+ monocytes from the blood of healthy and those from patients with Alcohol-associated Hepatitis.

      Overall, the authors perform a cursory analysis of the HiC data and conclude that there are a large number of changes in 3D genome architecture between healthy and AH patient monocytes. They highlight some specific examples that are linked to changes in gene expression. The analysis is of such a preliminary nature that I would usually expect to see the data from all figures in just one or two figures.

      In addition, I have a number of concerns regarding the experimental design and the depth of the analyses performed that I think must be addressed.

      (1) There is a myriad of literature that describes the existence of cell-type-specific 3D genome architecture. In this manuscript, there is an assumption by the authors that the CD14+ CD16+ monocytes represent the same population from both the healthy and diseased patients. Therefore, the authors conclude that the differences they see in the HiC data are due to disease-related changes in the equivalent cell types. However, I am concerned that the AH patient monocytes may have differentiated due to their environment so that they are in fact akin to a different cell type and the 3D genome changes they describe reflect this. This is supported by published articles, for example: Dhanda et al., Intermediate Monocytes in Acute Alcoholic Hepatitis Are Functionally Activated and Induce IL-17 Expression in CD4+ T Cells. J Immunol (2019) 203 (12): 3190-3198, in which they show an increased frequency of CD14+ CD16+ intermediate monocytes in AH patients that are functionally distinct.

      I suggest that if the authors would like to study the specific effects of AH on 3D genome architecture then they should carefully FACsort the equivalent monocyte populations from the healthy and AH patients.

      (2) The analysis of the HiC data is quite preliminary. In the 3D genome field, it is usual to report the different scales of genome architecture, for example, compartments, topologically associated domains (TADs) and loops. I think that reporting this information and how it changes in AH patients in the appropriate cell types would be of great interest to the field.

      Comments on revisions:

      In the revision the authors did not respond to my concerns which I believe still remain valid and compromise the author's conclusions of AH-specific effects on genome architecture.

    3. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      The authors investigate the relationship between 3D chromatin architecture and innate immune gene regulation in monocytes from patients with alcohol-associated hepatitis (AH). Using Hi-C technology, they attempt to identify structural changes in the genome that correlate with altered gene expression. Their central claim is that genome restructuring contributes to the hyper-inflammatory phenotype associated with AH.

      Strengths:

      (1) The manuscript employs Hi-C technology, which, in principle, is a powerful approach for studying genome organization.

      (2) The focus on disease-relevant genes, particularly innate immune loci, provides a contextually important angle for understanding AH.

      Weaknesses:

      (1) Sample Size: The study relies on an exceptionally small cohort (4 AH patients and 4 healthy controls), rendering the results statistically underpowered and highly susceptible to variability.

      (2) Hi-C Resolution unpaired to RNA seq: The data are presented at a resolution of 100kb, which is insufficient to uncover meaningful chromatin interactions at the level of individual genes. This data is unpaired.

      (3) Functional Validation: The manuscript lacks experiments to directly link changes in chromatin architecture with gene expression or monocyte function, leaving the claims speculative.

      (4) Data Integration: The lack of Hi-C with ATAC and RNA-seq data handicaps the analysis and really makes it superficial. In short, it does not convincingly demonstrate a functional relationship.

      (5) Confounding Factors: The manuscript neglects critical confounding variables such as comorbidities, medications, and lifestyle factors, which could influence chromatin structure and gene expression independently of AH.

      Appraisal of the Aims and Results:

      The manuscript sets out to establish a connection between chromatin architecture and AH pathology. However, the study fails to achieve its stated aims due to inadequate methods and insufficient data. The conclusions drawn from the Hi-C analyses alone are poorly supported, and the lack of functional validation undermines the credibility of the proposed mechanisms. Overall, the results do not provide compelling evidence to substantiate the authors' claims.

      Impact on the Field and Utility to the Community:

      The work, in its current form, is unlikely to have a meaningful impact on the field. The limited scope, methodological shortcomings, and lack of robust data significantly diminish its potential utility. Without addressing these critical gaps, the study does not offer new insights into the role of genome architecture in AH or provide useful methodologies or datasets for the community.

      Additional Context:

      The manuscript would benefit from a more comprehensive analysis of potential mechanisms underlying the observed changes, including the interplay between chromatin architecture and epigenetic modifications. Furthermore, longitudinal studies or therapeutic interventions could provide insights into the dynamic aspects of genome restructuring in AH. These considerations are entirely absent from the current study.

      Conclusion:

      The manuscript does not achieve its stated goals and does not present sufficient evidence to support its conclusions. The limitations in sample size, resolution, and experimental rigor severely hinder its contribution to the field. Addressing these fundamental flaws will be essential for the work to be considered a meaningful addition to the literature.

      Reviewer #2 (Public review):

      Summary:

      Dr. Adam Kim and collaborators study the changes in chromatin structure in monocytes obtained from alcohol-associated hepatitis (AH) when compared to healthy controls (HC). Through the usage of high throughput chromatin conformation capture technology (Hi-C), they collected data on contact frequencies between both contiguous and distal DNA windows (100 kB each); mainly within the same chromosome. From the analyses of those data in the two cohorts under analysis, authors describe frequent pairs of regions subject to significant changes in contact frequency across cohorts. Their accumulation onto specific regions of the genome -referred to as hotspots- motivated authors to narrow down their analyses to these disease-associated regions, in many of which, authors claim, a number of key innate immune genes can be found. Ultimately, the authors try to draw a link between the changes observed in chromatin architecture in some of these hotspots and the differential co-expression of the genes lying within those regions, as ascertained in previous single-cell transcriptomic analyses.

      Strengths:

      The main strength of this paper lies in the generation of Hi-C data from patients, a valuable asset that, as the authors emphasize, offers critical insights into the role of chromatin architecture dysregulation in the pathogenesis of alcohol-associated hepatitis (AH). If confirmed, the reported findings have the potential to highlight an important, yet overlooked, aspect of cellular dysregulation-chromatin conformation changes - not only in AH but potentially in other immune-related conditions with a component of pathological inflammation.

      Weaknesses:

      In what I regard as the two most important weaknesses of the work, I feel that they are more methodological than conceptual. The first of these issues concerns the perhaps insufficient level of description provided on the definition of some key types of genomic regions, such as topologically associated domains, DNA hotspots, or even DNA loci showing significant changes in contact frequency between AH and HC. In spite of the importance of these concepts in the paper, no operational, explicit description of how are they defined, from a statistical point of view, is provided in the current version of the manuscript.

      Without these definitions, some of the claims that authors make in their work become hard to sustain. Some examples are the claim that randomizing samples does not lead to significant differences between cohorts; the claim that most of the changes in contact frequency happen locally; or the claim that most changes do not alter the structure of TADs, but appear either within, or between TADs. In my viewpoint, specific descriptions and implementation of proper tests to check these hypotheses and back up the mentioned specific claims, along with the inclusion of explicit results on these matters, would contribute very significantly to strengthening the overall message of the paper.

      The second notable weakness of the study pertains to the characterization of the changes observed around immune genes in relation to genome-wide expectations. Although the authors suggest that certain hotspots contain a high number of immune-related genes, no enrichment analysis is provided to verify whether these regions indeed harbor a higher concentration of such genes compared to other genomic areas. It would be important for readers to be promptly informed if no such enrichment is observed, for in that case, the presence of some immune genes within these hotspots would carry more limited implications.

      Additionally, the criteria used to define a hotspot are not clearly outlined, making it difficult to assess whether the changes in contact frequencies around the immune genes highlighted in figures 5-8 are truly more pronounced than what would be expected genome-wide.

      Reviewer #3 (Public review):

      In this manuscript, the authors use HiC to study the 3D genome of CD14+ CD16+ monocytes from the blood of healthy and those from patients with Alcohol-associated Hepatitis.

      Overall, the authors perform a cursory analysis of the HiC data and conclude that there are a large number of changes in 3D genome architecture between healthy and AH patient monocytes. They highlight some specific examples that are linked to changes in gene expression. The analysis is of such a preliminary nature that I would usually expect to see the data from all figures in just one or two figures.

      In addition, I have a number of concerns regarding the experimental design and the depth of the analyses performed that I think must be addressed.

      (1) There is a myriad of literature that describes the existence of cell type-specific 3D genome architecture. In this manuscript, there is an assumption by the authors that the CD14+ CD16+ monocytes represent the same population from both healthy and diseased patients. Therefore, the authors conclude that the differences they see in the HiC data are due to disease-related changes in the equivalent cell types. However, I am concerned that the AH patient monocytes may have differentiated due to their environment so that they are in fact akin to a different cell type and the 3D genome changes they describe reflect this. This is supported by published articles for example: Dhanda et al., Intermediate Monocytes in Acute Alcoholic Hepatitis Are Functionally Activated and Induce IL-17 Expression in CD4+ T Cells. J Immunol (2019) 203 (12): 3190-3198, in which they show an increased frequency of CD14+ CD16+ intermediate monocytes in AH patients that are functionally distinct.

      I suggest that if the authors would like to study the specific effects of AH on 3D genome architecture then they should carefully FACsort the equivalent monocyte populations from the healthy and AH patients.

      (2) The analysis of the HiC data is quite preliminary. In the 3D genome field, it is usual to report the different scales of genome architecture, for example, compartments, topologically associated domains (TADs), and loops. I think that reporting this information and how it changes in AH patients in the appropriate cell types would be of great interest to the field.

      We thank the reviewers for their careful and thorough examination of our manuscript. We agree with all of their comments regarding the limitations of the study. Many of the criticisms focus on the small sample size of our study (n=4 for healthy controls and disease patients) in both Hi-C and single-cell RNA-seq experiments, and that these experiments are unpaired, or in other words, PBMCs came from different patients for each experiment.

      Unfortunately, these experiments are fairly complicated to perform, requiring patient cells and very expensive deep sequencing. We are not currently in a position to be able to easily or cost effectively increase sample size. In the case of Hi-C, we still believe our study to be of value as Hi-C is not a commonly used technique to study disease effects on chromatin, and very few studies have employed a large enough sample size to perform statistical comparisons. Additionally, to analyze the data at a higher resolution would require deeper sequencing, and unfortunately we do not have the resources to sequence these libraries deeper. Regarding the single-cell RNA-seq data, this dataset was generated for an earlier study [1] focusing on gene expression responses to LPS, and we were unable to get PBMCs from exactly the same patients to perform the Hi-C study.

      We disagree that our study has limited scientific value. Our study is the first to use Hi-C to show that the 3D genome architecture of primary monocytes is changed in a disease context. The only other study to follow a similar approach performed Hi-C in monocytes from 2 healthy and 2 Systemic lupus erythematosus (SLE) patients, and in their study the data from both patients were combined prior to comparison. No statistics were performed and their conclusion was no differences in genome architecture due to disease. They did find differences between primary monocytes and the THP1 monocytic cell line, but this lacked statistical analysis. Their conclusion was that inflammatory disease may not lead to genome wide changes in architecture. Our study, though a very different disease than SLE, shows statistically significant differences between AH and healthy controls. We believe our study lays the groundwork for how Hi-C can be used to study genome architecture in human disease, and the possible downstream effects.

      Confounding Factors: The manuscript neglects critical confounding variables such as comorbidities, medications, and lifestyle factors, which could influence chromatin structure and gene expression independently of AH.

      This is an interesting suggestion. This dataset only contains 4 AH patients, which we have included basic clinical data in Supplemental Table 1, including Age, HCA1c, Bilirubin, AST, ALT, Creatinine, Albumin, and MELD score. 3/4 of these patients are severe AH while 1 is moderate (AH2). Despite one patient being moderate, all four AH patients had similar correlations with each other, suggesting these disease specific differences we observed are not indicative of severity. More patient samples are needed to determine if genome architecture changes throughout disease progression. We have added this important discussion to the manuscript (page 12, lines 5-14).

      Recommendations for the authors:

      Reviewer #2 (Recommendations for the authors):

      The criteria used to determine which pairs of regions exhibit significant differences in contact frequency between alcohol-associated hepatitis (AH) and healthy controls (HC) are not disclosed. It would be beneficial for the authors to provide this information, including details such as the number of pairs tested, the nature of the statistical tests conducted, the method of multiple testing correction applied, as well as the significance thresholds used, and the number of loci-pairs below these thresholds for each chromosome. This information would greatly enhance the reader's understanding of the relevance of the reported findings.

      Thank you for this comment, though we are not sure we totally understand. All of our statistics were performed using multiHiCcompare [2], where we input all 8 datasets (.hic files from Juicer), then measured statistical differences between defined groups (HC vs AH). For our randomization studies, we randomized the group comparisons, so each group contained a mix of HC and AH.

      Second, a formal statistical definition of what constitutes a hotspot would be valuable for clarity.

      Thank you for this suggestion. Initially, hotspots were defined as just regions of the genome with a high frequency of very significant differential contacts. We have defined a more formal definition of “hotspot” based on similar criteria. A hotspot is defined by both adjusted p value and frequency of locations. First, we filtered all pair-wise chromosomal interactions by a very, very stringent padj < 0.0000001 to focus on only the most changed coordinates (Supplemental Table 4). Then we looked for regions of the genome with a high frequency of these differential locations. Borders for each hotspot were determined more liberally by looking at the full list of differential spots (padj < 0.05). Then we used code to list genes within each interacting region. We have added these important details to the Methods (page 14, lines 11-14).

      Third, a clear definition of the criteria used to identify different topologically associated domains (if these were indeed defined in the data and/or utilized in the analyses) would also be a helpful addition.

      Thank you for this suggestion, we did not identify TADs or really utilize TADs in any of these analyses.

      Likewise, several statements throughout the paper lack support from specific analyses, although it should be feasible to implement such analyses (or at least present them if they have already been conducted) to substantiate these claims:

      If randomizing samples does not result in significant differences between (randomized) cohorts, it would be beneficial to provide insights into the number of loci pairs that exhibit differences in frequency when using both the actual and randomized cohorts.

      Thank you for asking this question, as this is an important point. Using multiHiCcompare, if we compare WT (n=4) to AH (n=4), we get the results in the figures and supplementary data but if we randomize Group 1 (WT, WT, AH, AH) vs Group 2 (WT, WT, AH, AH), we get almost 0 significant changes in contact frequency. To show this more robustly, we performed 5 randomized comparisons and found far fewer changes in contact frequency between groups. This shows that these changes in contact frequency caused by disease are not random, but rather due to our real difference in AH. This point has been added to the Results (page 6, lines 15-17), and Methods (page 14, lines 16-21)

      If most changes in contact frequency occur locally, it would be useful to visualize the relationship between effect sizes and/or significance levels for the observed differences in frequency in relation to the distance between the involved loci. Additionally, comparing these results to the average baseline contact intensities as a function of distance would be informative. This comparison could help determine whether the distance decay in effect size/significance for the differences between AH and HC is faster or slower than the decay rates for baseline contact frequencies.

      This is a good suggestion. In our initial analysis, we made a number of figures relating chromosome positions, distance between loci, and statistics regarding the differential contact frequency. In the initial submission, we only showed Figure 3, which shows the logFC (log fold change) for the differential contact frequency by chromosomal position on both sides. To address this question, we have added a supplemental figure showing logFC as a function of the distance between two loci (new Supplemental Figure 3)

      Similarly, the assertion that most changes do not affect the structure of topologically associated domains (TADs) but occur either within or between TADs should be supported by specific testing; otherwise, or else, removed.

      Thank you, yes we have adjusted the language in the Discussion

      Furthermore, the authors should clarify whether differences in chromatin conformation are more pronounced around immune genes compared to genome-wide expectations. If this is not the case, it would be helpful to quantify the intensity of these differences around the highlighted genes in relation to the rest of the genome. To achieve this, I would suggest the following:

      Conduct enrichment analyses on the genes located within the most prominent hotspots to determine whether they are significantly enriched in immune genes (and, or, alternatively, in any other functional category).

      Estimate the average absolute fold change in contact frequency within all topologically associated domains (TADs) identified in the study. This would allow for the identification of immune gene-containing TADs highlighted in Figures 5-8, providing readers with a quantitative understanding of how anomalously different these genomic regions are with regards to the magnitude of its alterations in AH, compared to the rest of the genome.

      While some of the selected gene clusters appear to co-localize well with topologically associated domains (e.g., Figures 5A, 8A), others seemingly encompass either multiple TADs (Figure 6) or only portions of them (Figure 7). This should be clarified.

      Thank you, this is a great suggestion. In order to be as unbiased as possible, we took all genes present in the regions with the highest significant changes in genome (Supplemental Table 4) that we used to identify the hotspots. And you are correct, we do in fact see enrichment of genes involved in innate immune signaling. This has been added to Results (page 7, lines 19-25) and Figure 4.

      Finally, there are several minor issues concerning the figures that could be easily addressed to substantially enhance their readability:

      Font sizes in most figures should be increased, particularly for some axis labels and tick marks. This issue affects most figures; for instance, in Figure 4, it hinders the reader's ability to interpret the ranges of the data presented.

      Thank you, the figures have been adjusted

      Figures 5 to 8 (panels A and B) would benefit significantly from a more consistent format. Specifically, the gene cluster boxes should also be included in the right panels, and the gene locations should be displayed on the left in a uniform format across all figures (e.g., formatting Figures 7 and 8 to match the style of Figures 5 and 6).

      Figures 5 and 6 have a similar structure to each other because we were focusing on all of the genes in that chromosomal region. Figures 7 and 8 are different because we are focusing on how the region around a certain hotspot of interest changes.

      It is also important to note that the genes plotted in Figures 8C and 8D are not the same. Concerning these two panels, it would be valuable to clarify whether the data presented pertains exclusively to monocytes. If so, information regarding the number of cells analyzed and the number of donors from which they were drawn would also be beneficial.

      These figures are generated using scRNA-seq data. They represent all of the genes expressed in that region of the genome, in their chromosomal position. If a gene is not expressed in the scRNA-seq data, then it is not shown. I have debated with myself a lot on how to show gene expression in a region of the genome, but I think this is the clearest way to show this; including the genes that have no expression would make it more confusing. But yes, if you compare HC and AH, you see some differences in the list of genes. We have added more clarity to the figure legend for this figure.

      References

      (1) Kim, A., Bellar, A., McMullen, M. R., Li, X. & Nagy, L. E. Functionally Diverse Inflammatory Responses in Peripheral and Liver Monocytes in Alcohol-Associated Hepatitis. Hepatol Commun 4, 1459-1476 (2020). https://doi.org:10.1002/hep4.1563

      (2) Stansfield, J. C., Cresswell, K. G. & Dozmorov, M. G. multiHiCcompare: joint normalization and comparative analysis of complex Hi-C experiments. Bioinformatics 35, 2916-2923 (2019). https://doi.org:10.1093/bioinformatics/btz048

    1. eLife Assessment

      This manuscript presents a valuable antiviral approach using an engineered ACE2-Fc fusion protein that demonstrates broad-spectrum neutralization capacity against SARS-CoV-2 variants and achieves significant prophylactic protection in animal models through a novel Fc-mediated phagocytosis mechanism. The study provides convincing evidence for protective efficacy through rigorous in vivo validation in mice, mechanistic characterization via transcriptomic analysis and biodistribution studies, and demonstration of antibody-dependent cellular phagocytosis as the primary clearance mechanism mediated by the decoy. The work will be of interest to researchers working in vaccine development and associated immune responses.

    2. Reviewer #1 (Public review):

      Summary:

      This manuscript by Wang et al. describes the development of an optimized soluble ACE2-Fc fusion protein, B5-D3, for intranasal prophylaxis against SARS-CoV-2. As shown, B5-D3 conferred protection not only by acting as a neutralizing decoy, but also by redirecting virus-decoy complexes to phagocytic cells for lysosomal degradation. The authors showed complete in vivo protection in K18-hACE2 mice and investigated the underlying mechanism by a combination of Fc-mutant controls, transcriptomics, biodistribution studies, and in vitro assays.

      Strengths:

      The major strength of this work is the identification of a novel antiviral approach with broad-spectrum and beyond simple neutralization. Mutant ACE2 enables broad and potent binding activity with the S proteins of SARS-CoV-2 variants, while the fused Fc part mediates phagocytosis to clear the viral particles. The conceptual advance of this ACE2-Fc combination is convincingly validated by in vivo protection data and by the completely abrogated protection of Fc LALA mutant.

      Additionally:

      The authors include a discussion (in Discussion part) about a previously reported ACE2 decamer (DOI: 10.1080/22221751.2023.2275598) and compared with the ACE2-Fc fusion protein developed in this study. The authors also tested the off-target activity and showed no evidence of toxicity in vivo.

    3. Reviewer #2 (Public review):

      Summary:

      Wang et al. engineered an ACE2 mutant by introducing two mutations (T92Q and H374N), and fused this ACE2 mutant to human IgG1-Fc (B5-D3). Experimental results suggest that B5-D3 exhibits broad-spectrum neutralization capacity and confers effective protection upon intranasal administration in SARS-CoV-2-infected K18-hACE2 mice. Transcriptomic analysis suggests that B5-D3 induces early immune activation in lung tissues of infected mice. Fluorescence-based bio-distribution assay further indicates rapid accumulation of B5-D3 in the respiratory tract, particularly in airway macrophages. Further investigation shows that B5-D3 promotes viral phagocytic clearance by macrophages via an Fc-mediated effector function, namely antibody-dependent cellular phagocytosis (ADCP), while simultaneously blocking ACE2-mediated viral infection in epithelial cells. These results provide some insights into improving decoy treatments against SARS-CoV-2 and other potential respiratory viruses.

      Strengths:

      The protective effect of this ACE2-Fc fusion protein against SARS-CoV-2 infection has been evaluated in a reasonable way.

      Weaknesses:

      (1) Some of the mice experiments suffer from insufficient sample numbers, which affect the statistical power and reliability of the results. The author acknowledged this weakness, noting that the supply of aged mice was limited, while arguing that, although the sample size is small, the data from these mice are consistent.

      (2) Compared to 6 hours, intranasal administration of B5-D3 at 24 hours before viral infection results in reduced protective efficacy. However, only survival and body weight data are provided, with no supporting evidence from virological assays such as viral titer measurement. The author acknowledged that such data would be more comprehensive and attributed the limitation to constraints in animal services.

      (3) The efficacy of the B5-D3-LALA group was not as good as that of the B5-D3 group. The author suggested that there might be a certain degree of viral variation, and viral infection in the lungs may be uneven in the B5-D3-LALA group.

    4. Reviewer #3 (Public review):

      Strengths:

      The core strength of this study lies in its innovative demonstration that an engineered sACE2-Fc fusion redirects virus-decoy complexes to Fc-mediated phagocytosis and lysosomal clearance in macrophages, revealing a distinct antiviral mechanism beyond traditional neutralization. Its complete prophylactic protection in animal models and precise targeting of airway phagocytes establish a novel therapeutic paradigm against SARS-CoV-2 variants and future respiratory viruses.

      Weaknesses:

      The study attributes the complete antiviral protection to Fc-mediated phagocytic clearance, a central claim that requires more rigorous experimental validation. The observation that abrogating Fc functions compromises protection could be confounded by potential alterations in the protein's stability, half-life, or overall structure. To firmly establish this mechanism, it is crucial to include a control molecule with a mutated Fc region that lacks FcγR binding while preserving the Fc structure itself. Without this critical control, the conclusion that phagocytic clearance is the primary mechanism remains inadequately supported. The strategy of deliberately targeting virus-decoy complexes to phagocytes via Fc receptors inherently raises the question of Antibody-Dependent Enhancement (ADE) of disease. While the authors demonstrate a lack of productive infection in macrophages, this only addresses one facet of ADE. The risk of Fc-mediated exacerbation of inflammation (ADE) remains a critical concern. The manuscript would be significantly strengthened by a direct discussion of this risk and by including data, such as cytokine profiling from treated macrophages, to more comprehensively address the safety profile of this approach. The exclusive use of the K18-hACE2 mouse model, which exhibits severe disease, limits the generalizability of the findings. The "complete protection" observed may not translate to models with more robust and naturalistic immune responses or to human physiology. Furthermore, the lack of data against circulating SARS-CoV-2 variants of concern. The concept of sACE2-Fc fusion proteins as decoy receptors is not novel, and numerous similar constructs have been previously reported. The manuscript would benefit from a clearer demonstration of how the optimized B5-D3 mutant represents a significant advance over existing sACE2-Fc designs. A direct comparative analysis with previously published benchmarks, particularly in terms of neutralizing potency, Fc effector function strength, and in vivo efficacy, is necessary to establish the incremental value and novelty of this specific agent.

      Comments on revised version:

      The author has successfully addressed the raised issue.

    5. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      This manuscript by Wang et al. describes the development of an optimized soluble ACE2-Fc fusion protein, B5-D3, for intranasal prophylaxis against SARS-CoV-2. As shown, B5-D3 conferred protection not only by acting as a neutralizing decoy, but also by redirecting virus-decoy complexes to phagocytic cells for lysosomal degradation. The authors showed complete in vivo protection in K18-hACE2 mice and investigated the underlying mechanism by a combination of Fc-mutant controls, transcriptomics, biodistribution studies, and in vitro assays.

      Strengths:

      The major strength of this work is the identification of a novel antiviral approach with broad-spectrum and beyond simple neutralization. Mutant ACE2 enables broad and potent binding activity with the S proteins of SARS-CoV-2 variants, while the fused Fc part mediates phagocytosis to clear the viral particles. The conceptual advance of this ACE2-Fc combination is convincingly validated by in vivo protection data and by the completely abrogated protection of Fc LALA mutant.

      We thank the reviewer for his recognition and positive comments on our study.

      Weaknesses:

      Some aspects could be further modified.

      (1) A previously reported ACE2 decamer (DOI: 10.1080/22221751.2023.2275598) needs to be mentioned and compared in the Discussion part.

      We thank the reviewer for pointing out this weakness.

      Indeed, previous studies reported that the ACE2-IgM decamer, taking advantage of the decameric structure of IgM, exhibited higher avidity to spikes and greater potency for viral neutralization [1-3]. In particular, the study by Guo et al. has demonstrated a broad-spectrum neutralization ability of the ACE2-IgM decamer against multiple SARS-CoV-2 variants and reported the efficacy of intranasal prophylaxis in preventing lethal SARS-CoV-2 challenge in K18-hACE2 mice.

      We agree with the reviewer that it is promising that our B5-D3 design would benefit from switching to the IgM isotype. However, the distinct biological features imposed by IgM Fc, including short serum half-life and restricted tissue penetration [4], may complicate the study design and diverge our focus.

      In our current study, we would focus on the IgG1 Fc-based decoy design, while inactivating the enzyme activity of ACE2 to avoid disturbing the renin angiotensin system. This design allowed us to compare diverse administration routes and regimens and to gain useful insights into the potential of sACE2-Fc decoy in combating SARS-CoV-2 in vivo.

      We appreciated the reviewer‘s insightful suggestion. In the revised manuscript, we have included additional discussion regarding ACE2-IgM decamer, addressing the relevant concern on page 17 lines 409–414.

      (2) Limitations of this study, such as off-target binding and potential immunogenicity, should also be discussed.

      We thank the reviewer for his insightful comments and agree that off-target activity is a major concern for designing the ACE2 decoy.

      (1) In our study, the representative sACE2-Fc decoy candidate B5-D3 contains H374N mutation (D3) that is designed to inactivate ACE2 enzyme activity by causing dyscoordination of Zn2+. Our in vitro enzymatic activity assay has demonstrated that the H374N mutation (D3), as well as other three single mutations D1, D4 and D5, in either WT sACE2-Fc or B5 mutant, could effectively abolish the hACE2 enzyme activity (Supplementary Fig. 2e, h).

      (2) To further address the concern on off-target activity, we performed AAV-based overexpression experiments in K18-hACE2 mice and examined serum levels of RAS hormones, using ELISA methods that specifically detect serum renin, Angiotensin II (Ang II), and Ang (1-7). While our data from WT sACE2-Fc overexpression revealed significantly elevated serum renin and Ang II, indicating a disruption of the RAS (Supplementary Fig. 4d, e); the results from examined double mutants, including B5-D3, showed negligible change in any of these metabolite levels, demonstrating no off-target effect and minimal disturbance to the RAS activity in K18-hACE2 mice (Supplementary Fig. 4d–f).

      (3) Moreover, in this experiment, after the prolonged overexpression of all these molecules in K18hACE2 mice, histological examination of multiple organs showed no evidence of immune cell infiltration and tissue damage and no difference was observed between the mice receiving WT sACE2-Fc or B5-D3(Supplementary Fig. 4g).

      In the revised manuscript, we have included the results from the AAV-delivered in vivo overexpression of WT sACE2-Fc and three most promising double mutants (B5-D3, B5-D4 and B5-D5) on page 5 lines 118–122 and on page 6 lines 123–135 in the main text. The relevant data were presented in the new Supplementary Fig. 4.

      Reviewer #2 (Public review):

      Summary:

      Wang et al. engineered an optimized ACE2 mutant by introducing two mutations (T92Q and H374N) and fused this ACE2 mutant to human IgG1-Fc (B5-D3). Experimental results suggest that B5-D3 exhibits broad-spectrum neutralization capacity and confers effective protection upon intranasal administration in SARS-CoV-2-infected K18-hACE2 mice. Transcriptomic analysis suggests that B5D3 induces early immune activation in lung tissues of infected mice. Fluorescence-based biodistribution assay further indicates rapid accumulation of B5-D3 in the respiratory tract, particularly in airway macrophages. Further investigation shows that B5-D3 promotes viral phagocytic clearance by macrophages via an Fc-mediated effector function, namely antibody-dependent cellular phagocytosis (ADCP), while simultaneously blocking ACE2-mediated viral infection in epithelial cells. These results provide insights into improving decoy treatments against SARS-CoV-2 and other potential respiratory viruses.

      Strengths:

      The protective effect of this ACE2-Fc fusion protein against SARS-CoV-2 infection has been evaluated in a quite comprehensive way.

      We thank the reviewer for his recognition and positive comments on our study.

      Weaknesses:

      (1) The paper lacks an explanation regarding the reason for the combination of mutations listed in Supplementary Figure 2b. For example, for the mutations that enhance spike protein binding, B2-B6 does not fully align with the mutations listed in Table S1 of Reference 4, yet no specific criteria are provided.

      We thank the reviewer for pointing out this negligence.

      We constructed the B2-B6 mutants based on the study by Chan et al. [5] (Reference 4 in the previous version), mainly referencing to their Fig. 1A rather than to their Table S1. In Chan’s study, each of the proposed mutations were discovered as single mutations in monomeric sACE2 molecules based on the enrichment in target cell-binding. T92 was a notable hot spot for enriched mutations in their Fig. 1A.

      Since monomeric and dimeric forms of sACE2 showed dramatically different kinetics for ACE2-RBD interaction, we selected five proposed mutations and further examined their affinity and activity in dimeric sACE2-Fc in our study. We chose not only the combinations of mutations, such as B3, B4, and B6 proposed in their Table S1, but also explored less-complicated mutation(s) like B2 (T27Y/L79T) and B5 (T92Q) in their Fig. 1A, which were in silico predicted to enhance ACE2-RBD binding but not tested in sACE2-Fc in Chan’s study.

      Interestingly, although our results confirmed enhanced viral neutralization by all these mutations, the activity increase compared to WT ACE2-Fc was rather limited. Hence, we chose not to explore other mutations but to focus on B2–B6 to construct an enhanced ACE2-Fc decoy as a representative, to investigate the potential of ACE2-Fc decoys in combating SARS-CoV-2 infections.

      In the revised manuscript, we have further amended the writing on page 4 lines 84–87 to enhance the readability. Whereas for conciseness of the manuscript, we did not describe in too much detail how we selected the mutations to be tested.

      Second, for the mutations that abolished enzymatic activity, while D1 and D2, D3, D4, and D5 are cited from References 12, 11, and 33, respectively, the reason for combining D3 and D4 into A2, and D1 and D2 into A3 remains unexplained. It is also unclear whether some of these other possible combinations have been tested. Furthermore, for the B5-derived mutations, only double-mutant combinations with D1-D5 are tested, with no attempt made to evaluate triple mutations involving A2 or A3.

      We thank the reviewer for pointing out this negligence.

      A2 and A3 mutations were originally proposed as double mutations [6,7]. A2 (H374N/H378N) was first reported by Guy et al. [6] (Reference 11 in the previous version), while A3 (R273G/T445G) was originally proposed in Payandeh et al.’s study [7] (Reference 33 in the previous version).

      In this study, we further split the two mutations in A2 and A3, to generate the single enzymedeactivating mutations, D1 and D2 from A3, and D3 and D4 from A2. Among these single mutations, D2 failed to inactivate ACE2 enzymatic activity (Supplementary Fig. 2e), and it was excluded in subsequent analyses.

      D5 (H345L) was a single mutation directly adopted from the report by Glasgow et al. [8] (Reference 12 in the previous version).

      After combining the B5 with the enzyme-deactivating mutations (A2, A3, D1, D3, D4, D5), our neuralization assay results showed that, the simpler compound mutants with only two mutations, like B5-D1, B5-D3, B5-D4 and B5-D5, exhibited stronger neutralization capacity than B5-A2 and B5-A3 with triple mutations. Moreover, since fewer mutations were more favorable to reduce risks in causing protein structure alteration and evoking host immunity, we then focused on the sACE2-Fc double mutants B5-D3, B5-D4 and B5-D5 in the subsequent neutralization and overexpression assays (Supplementary Fig. 3 and 4), and examined B5-D3 as a representative candidate in the in vivo infection tests and follow-up analysis (Figure 2–6, and Supplementary Figures 5–18).

      We agree that the lack of explanation for splitting A2 and A3 into D1 to D4 single mutations made the rationale unclear. In the revised manuscript, we have included our previous test results on B5-A2 and B5-A3, cited Lei et al.’s study using A2 in ACE2 decoy [9], and explained the rationale for splitting A2 and A3 into D1 to D4 mutations. Relevant revision was made on page 4 lines 94–97 in the main text, while the design and data for B5-A2 and B5-A3 were included in the revised Figure 1b and Supplementary Figure 2b, f–h.

      (2) Figures 1b, 1d, and 1e lack statistical analyses, making it difficult to determine whether B5 and D3 exhibit significant advantages. For Wuhan-Hu-1 strain, B2 and B5 are similar, and for D614G strain, B2, B3, B4, B5, and B6 display comparable results. However, only the glycosylation-related single mutant B5 is chosen for further combinatorial constructs. Moreover, for VOC/VOI strains, B5 is superior to B5-D3; for the Alpha strain, B5-D4 and B5-D5 are superior to B5-D3; and for the Delta and Lambda strains, B5-D5 is superior to B5-D3. These observations further highlight the need for a clearer explanation of the selection strategy.

      We agree with the reviewer’s insightful observations.

      Indeed, although our results confirmed enhanced viral neutralization by these reported mutations, the activity increases compared to WT ACE2-Fc were generally limited. Importantly, these observations were largely consistent with other reports (including the study by Chan et al. [5]), suggesting limited potential of mutagenesis in enhancing the ACE2-RBD/Spike interaction. Therefore, we chose to selectively examine B2-B6 to construct an enhanced ACE2-Fc decoy with reasonable performance, as a representative candidate to study the application potential of ACE2-Fc decoy.

      The IC<sub>50</sub> values in Figures 1b, 1d, and 1e were calculated from neutralization curves, measuring infection reduction at multiple concentrations in duplicates, which therefore were presented with statistical support. Based on the multiple neutralization assays, B5-D3 consistently showed a high performance among other top-performers (Figure 1, Supplementary Fig. 2f,g, and Supplementary Fig. 3).

      We agree that B2 and B5 performed comparably well in neutralization assays, but B2 contains two mutations (T27Y/T92Q) while B5 carries a single mutation (T92Q). Hence, we decided to focus on B5 due to its lowest mutational burden and least potential risk.

      We agree that for VOC/VOI strains, B5 was superior to B5-D3 in pseudovirus-neutralization assays. However, B3-D3 was enzymatically inactive, which is essential for generating safe ACE2 decoy and, therefore, justifies our usage of B5-D3 over B5.

      We agree with the reviewer that, altogether, the B5-D3 did not show significant advantages than other top performers like B5-D4 and B5-D5. Here, B5-D3 was selected as a representative, which performed equally well rather than being the most outstanding candidate, for subsequent examination of efficacy, safety, and mechanistic insights.

      We thank the reviewer for his valuable feedback. In the revised manuscript, we have further amended our description of B5-D3, as a “representative” candidate, to improve the readability. Relevant changes can be found on page 4 line 84, page 5 line 109, page 14 line 333 and page 15 line 360.

      (3) Figure 1e does not specify the construct form of the control hIgG1, namely whether it is an hIgG1 Fc fragment or a full-length hIgG1 protein. If the full-length form is used, the design of its Fab region should be clarified to ensure the accuracy and comparability of the experimental control.

      We thank the reviewer for pointing out this negligence.

      In this study, we used the in vivo grade recombinant human IgG1 isotype control antibody in its full length (Syd labs, #PA007125) as the negative control. It is the 4F17 clone, which is widely used and showed low or no specific binding to any human samples [10] (Human IgG1 Isotype Control Antibody | Recombinant, in vivo Grade - Syd Labs). We have added the relevant information in the MATERIALS AND METHODS on page 23 lines 548–549.

      (4) In Figure 2a, all three PBS control mice died, whereas in Figure 2f, three out of five PBS control mice died, with the remaining showing gradual weight recovery. This discrepancy may reflect individual immune variations within the control groups, and it is necessary to clarify whether potential autoimmune factors could have affected the comparability of the results. Also, the mouse experiments suffer from insufficient sample sizes, which affects the statistical power and reliability of the results. In Figure 2a, each group contains only 4 replicates, one of which was used for lung tissue sampling. As a result, body weight monitoring data is derived from only 3 mice per group (the figure legend indicating n=4 should be corrected to n=3). Such a small sample size limits the robustness of the conclusions. Similarly, in Figure 2f, although each group has 5 replicates, body weight data are presented for only 4 mice, with no explanation provided for the exclusion of the fifth mouse. Furthermore, the lung tissue experiments in Figure 3a include only 3 replicates, which is also inadequate.

      We thank the reviewer for his valuable feedback.

      Figure 2a was the first in vivo infection experiment of this study, and we performed the test in aged female K18-hACE2 mice at 10–12 months old. Whereas for the subsequent experiments in Figure 2f and Figure 3, we changed to young female K18-hACE2 mice at 2–3 months old, because the limited supply of old mice. While in Figure 2a, four aged mice (not three) in the PBS control group all died within 7 dpi, results of Figure 2f and Figure 3 consistently showed heterogeneous responses among young mice in the PBS control groups. Since increased susceptibility to SARS-CoV-2 infection has been broadly observed among aged human populations and it was also supported by mouse study [11], here we would attribute the observed discrepancy to the age difference between the two cohorts in Figure 2a and 2f. In the revised manuscript, we have further elucidated this observation in results (on page 7 lines 163–167) and included a new reference for better clarification (page 7 line 167).

      Furthermore, because the PBS control mice in both Figure 2a and 2f died within 7 dpi, which was too soon for autoimmune factors to take place. Moreover, we have performed AAV-based prolonged overexpression experiments in K18-hACE2 mice (new Supplementary Fig. 4), which showed no tissue damage in either WT sACE2-Fc or B5-D3 treated mice, suggesting low immunogenicity. Collectively, the autoimmune factors are unlikely the reason leading to the different survival between PBS controls in Figure 2a and 2f.

      We thank the reviewer for pointing out the weakness regarding small sample sizes in our study.

      (1) In Figure 2a–c, the experiment was performed in an aged cohort at 10–12 months old, starting with 5 mice in each virus-inoculated group and 4 mice in the mock control group. At 4 dpi, we sacrificed one mouse from each group for tissue analysis. Therefore, in the survival analysis, there were 4 mice in each virus-inoculated group and 3 mice in the mock control group, whose survival and body weight changes were presented in Figure 2b, c.

      Despite the relatively small sample sizes in Figure 2b, c, all 4 PBS control mice died, while all 4 mice in 6-hour B5-D3 IN prophylaxis group survived, demonstrating 100% survival and no sign of body weight loss. The survival and body weight data were highly consistent, strongly supporting that B5-D3 intranasal prophylaxis could protect the mice from lethal SARS-CoV-2 infection.

      To enhance clarity, in the revised manuscript, we have added the sample size information in chart legends in Figure 2a–c.

      (2) In Figure 2f–h, the experiment was performed in a young cohort at 2–3 months old and the body weight and survival data were presented for 5 mice in each group (not for 4 mice). Notably, although 2 out of 5 young mice in the PBS control group eventually survived from the viral infection, they had suffered significant weight loss during 4–7 dpi, similarly to the died. Whereas all 5 mice in the – 6hr B5-D3 IN prophylaxis group showed no sign of weight loss. Hence, these data were highly consistent with Figure 2b, c, supporting the efficiency of B5-D3 IN prophylaxis in protection against SARS-CoV-2 infection.

      We noticed that some data points in Figure 2g, h were very close to each other, making it difficult to distinguish the data line for individual mice. To enhance clarity, in the revised manuscript, we have added sample-size information in chart legends in Figure 2g and 2h.

      (3) In Figure 3a, we aimed to examine the lung tissues at early time points. For each treatment, we have 3 mice sacrificed at a single selected time point. Hence, total 9 mice were examined in the PBS control group and B5-D3 IN group, yielding results at 1 dpi, 2 dpi and 4 dpi that consistently supported each other. Moreover, the viral titers, S, and N protein expression analysis all showed significant difference among different groups. Therefore, our experiments have enough discrepancy between different treatment groups to draw the conclusion.

      (5) Compared to 6 hours, intranasal administration of B5-D3 at 24 hours before viral infection results in reduced protective efficacy. However, only survival and body weight data are provided, with no supporting evidence from virological assays such as viral titer measurement. Therefore, the long-term effectiveness lacks sufficient experimental validation.

      In Figure 2f–h, we aimed to compare the efficacies of IN administration of B5-D3 at different timepoints, mainly focusing on the body weight change and survival data along the infection and recovery time. As indicated by early data in Figure 2d, viruses were largely cleared by 4 dpi in mice treated with B5-D3 prophylaxis. Therefore, in this test, we did not examine virus titers in the recovered animals by the end of observation at 14 dpi. Instead, we examined plasma levels of virus-neutralizing antibodies in the survivors at the endpoint, which indeed supported that the 6-hours and 24-hours IN B5-D3 prophylaxis provided effective protection against the SARS-CoV-2 infection and resulted in minimal levels of neutralizing antibodies in plasma, as shown in Figure 2i.

      Collectively, the body weight, survival, and antibody data all supported that 6-hour IN B5-D3 prophylaxis achieved the best efficacy. Hence, we performed comprehensive viral titer and profiling analysis at early time points like 1 dpi, 2 dpi, and 4 dpi, focusing only on the 6-hour IN B5-D3 prophylaxis. This works also included B5-D3-LALA control to examine viral titers, host immune responses, and underlying mechanisms (Figure 3,4).

      We agree with the reviewer that it would be more comprehensive if our experiments could include indepth analysis of the 24-hours IN B5-D3 prophylaxis group. However, due to limited capacity of animal service, we chose to focus on the best-performing group as a representative treatment to study the underlying mechanisms.

      (6) In Figures 3b and 3c, viral spike (S) and nucleocapsid (N) RNA relative expression levels are quantified by qPCR. The results show significant individual variation within the B5-D3-LALA treatment group: one mouse exhibits high S and N expression, while the other two show low expression. Viral load levels are also inconsistent: two mice have high viral loads, and one has a low viral load. Due to this variability, the available data are insufficient to robustly support the conclusion.

      We understand the reviewer’s concern on the variability within the B5-D3-LALA group. However, we have some reservations about the importance of further increasing the sample sizes in this test.

      First, since viral gene transcription and viral particle levels represented different phases in viral life, they may follow different kinetics during infection progression and lead to variability. Second, we used different parts of the lung tissues from each mouse for extracting RNA and tissue homogenates, which were then used for detection of S/N expression and viral load levels, respectively. The uneven viral infection in the lung might also contribute to the variability. Furthermore, in this test, both our qPCR and viral load analysis data consistently demonstrated that the B5-D3-LALA was less effective than B5-D3, indicating that Fc function played an important role in supporting full protection by B5-D3 against lethal SAS-CoV-2 infections. This observation is also supported by other studies [12].

      We appreciate the valuable feedback from the reviewer. In the revised manuscript, we have further clarified these observations on page 8, lines 192–194, and included alveolar thickening data on page 9, lines 202–204.

      (7) Figure 3e: "H&E staining indicated alveolar thickening in all groups," including the Mock group. Since the Mock group did not receive virus or active drug treatment, this observed change may result from local tissue reaction induced by the intranasal inoculation procedure itself, rather than specific immune activation. A control group (no manipulation) should be set to rule out potential confounding effects of the experimental procedure on tissue morphology, thereby allowing a more accurate assessment of the drug's effects.

      We thank the reviewer for his insightful comments and suggestions.

      We have further examined our H&E staining and quantified alveolar thickening in different treatment groups. Indeed, the data suggested a transient alveolar thickening in the mock group at 1 dpi, which was improved at 2 dpi. This observation supports that the intranasal procedure itself indeed caused a transient alveolar thickening, that was evident at 1 dpi but disappeared at 2 dpi.

      Notably, moderate alveolar thickening was found to be persistent in the B5-D3-treated mice till the end point at 4 dpi. Whereas the PBS groups with intensive SARS-CoV-2 infection progressively developed severe structural damage and showed much stronger alveolar thickening than B5-D3 or mock groups at 4 dpi. Consistent with the partial protection by B5-D3-LALA, histological analysis of lung samples in this group revealed severer yet heterogenous alveolar thickening. These observations suggested that -6h IN B5-D3 treatment prevented tissue damage brought by infection with minimal yet efficient immune activation.

      In the revised manuscript, we have included the quantitation results of alveolar thickening on page 9, lines 200–204 and presented the data in new Supplementary Fig. 7.

      (8) In Supplementary Figure 11b, a considerable number of alveolar macrophages (AMs) are observed in both the PBS and B5-D3 groups. This makes it difficult to determine whether the observed accumulation is specifically induced by B5-D3.

      We thank the reviewer for pointing out this issue.

      In this experiment, the cell populations examined in previous Supplementary Fig. 11b and Fig. 5h are different, though graphs appear similar.

      Supplementary Fig. 11b (new Supplementary Fig. 12b) showed the analysis among CD45+ immune cells, regardless of B5-D3-AF750 signal. The dominance of AMs among immune cell populations is a normal physiological feature of BALF cells. To make this clear, we have added new data of BALF cells from untreated mice in the revised manuscript and new Supplementary Fig. 12b.

      Fig. 5h displayed for cell type analysis among the CD45+ B5-D3-AF750+ cells —only CD45+ immune cells that took up the AF750-labeled B5-D3.

      To enhance clarity, in the revised manuscript, we have amended the labels as CD45+ B5-D3-AF750+ in Figure 5h (and similarly in revised Supplementary Fig. 13), to differentiate the data from that in CD45+ cells shown in the revised Supplementary Fig. 12b.

      (9) In the flow cytometry experiment shown in Figure 5, the PBS control group is not labeled with AF750, which necessarily results in a value of zero for "B5-D3+ cells" on the y-axis. An appropriate control (e.g., hIgG1-Fc labeled with AF750) should be included.

      We thank the reviewer for his valuable question.

      In this experiment, we intended to analyze all immune cells with positive AF750 signals, to identify the major immune cell types that took up AF750-B5-D3 as the candidate cells responsible for the observed activation of innate immunity. Hence, here we deliberately set PBS vehicle treatment without AF750 signal as the control group for gating.

      This analysis aimed to provide an overall picture of immune cell types that actively take up ACE2 decoy, likely via Fc receptor-mediated binding. Control IgG1 labeled with AF750, with an Fc region, may show similar profile and biodistribution among BALF immune cells, which, therefore, was not examined as control for gating.

      Instead, in the revised manuscript, we have added new analysis results comparing the efficiencies of B5-D3 and IgG1 in mediating pseudovirus uptake in THP-1-derived macrophages. IgG1 isotype control was examined to address ACE2-specific effect. Indeed, we observed no pseudovirus uptake based on p24 signal, in the IgG1 treated samples, indicating that the presence of B5-D3 is crucial for efficient pseudovirus uptake in macrophages due to the sACE2-spike affinity. These results have been added on page 13 lines 310–316 in the main text, and the relevant data was presented in new Supplementary Fig. 17.

      (10) The Methods section: a more detailed description of the experimental procedures involving HIV p24 and SARS-CoV-2 should be included.

      We thank the reviewer for pointing out this weakness.

      In the revised manuscript, we have provided further details of the relevant experimental procedures in the Materials and Methods part, on page 21, lines 507–517.

      Reviewer #3 (Public review):

      Strengths:

      The core strength of this study lies in its innovative demonstration that an engineered sACE2-Fc fusion redirects virus-decoy complexes to Fc-mediated phagocytosis and lysosomal clearance in macrophages, revealing a distinct antiviral mechanism beyond traditional neutralization. Its complete prophylactic protection in animal models and precise targeting of airway phagocytes establish a novel therapeutic paradigm against SARS-CoV-2 variants and future respiratory viruses.

      We thank the reviewer for his recognition and positive comments on our study.

      Weaknesses:

      The study attributes complete antiviral protection to Fc-mediated phagocytic clearance, a central claim that requires more rigorous experimental validation. The observation that abrogating Fc functions compromises protection could be confounded by potential alterations in the protein's stability, half-life, or overall structure. To firmly establish this mechanism, it is crucial to include a control molecule with a mutated Fc region that lacks FcγR binding while preserving the Fc structure itself. Without this critical control, the conclusion that phagocytic clearance is the primary mechanism remains inadequately supported.

      We thank the reviewer for his insightful comments and suggestions.

      The L234A/L235A mutations in human IgG1 Fc region are most widely used to abolish its FcγR binding and Fc effector functions [13]. In this study, we have used B5-D3-LALA in the in vivo infection experiments in K18-hACE2 mice, as the control molecule that lacks FcγR binding while preserving the Fc structure (Figure 3, 4).

      To address the reviewer’s concern, we further performed new analysis comparing the efficiencies of different versions of B5-D3 in mediating pseudovirus uptake in THP-1-derived macrophages. In this test, B5-D3-LALA and B5-D3 were examined side-by-side to address the role of Fc effector functions in the phagocytosis process. Meanwhile, IgG1 isotype control was examined to address ACE2-specific effect. Indeed, we detected significant reduction of pseudovirus uptake based on p24 signal, in the B5D3-LALA treated samples compared to those receiving B5-D3. This decreased pseudoviral uptake correlated with the loss of Fc-mediated effector functions in B5-D3-LALA, indicating the involvement of Fc functions in efficient macrophage uptake of B5-D3-virus complex.

      In the revised manuscript, we have included these results on page 13 lines 310–316 in the main text and presented relevant data in Supplementary Fig. 17.

      The strategy of deliberately targeting virus-decoy complexes to phagocytes via Fc receptors inherently raises the question of Antibody-Dependent Enhancement (ADE) of disease. While the authors demonstrate a lack of productive infection in macrophages, this only addresses one facet of ADE. The risk of Fc-mediated exacerbation of inflammation (ADE) remains a critical concern. The manuscript would be significantly strengthened by a direct discussion of this risk and by including data, such as cytokine profiling from treated macrophages, to more comprehensively address the safety profile of this approach.

      (1) We thank the reviewer for his insightful comments and suggestions regarding the ADE issue.

      Indeed, Antibody-Dependent Enhancement (ADE) of viral infection is a critical concern when developing the ACE2 decoy strategy. In this study, we have carefully examined the relevant risk based on our data from various in vitro and in vivo assays.

      In our in vivo infection experiments, all B5-D3 prophylaxis and treatment groups, regardless of the administration times and routes, showed improved outcomes like less body-weight loss and better survival, compared to the PBS control groups (Figure 2). None of these treatment groups demonstrated worsened infections, indicating that ADE phenomenon was not occurring or did not play a major role during the B5-D3 treatments. Instead, moderate immune activation was observed in the lung of B5-D3 treated mice, which occurred much earlier but was milder compared to that in the PBS groups, and may reflect responses that lead to the efficient early clearance of viruses without observable symptoms (Figure 3 and 4).

      In our in vitro assays shown in Figure 6, B5-D3 treatments in epithelial or non-immune cell models (hACE2-Galu-3 and hACE2-293T) significantly blocked the entry of pseudovirus into cells and yielded much reduced luciferase signals (Figure 6d–g). Whereas in the THP-1-derived macrophages, although the presence of B5-D3 largely enhanced the entry of SARS-CoV-2 pseudovirus into cells (Figure 6a,b), it did not result in active infection and produced no luciferase signal (Figure 6g). These results were robustly reproducible, indicating that pseudoviruses did not successfully release its genome RNA and viral proteins (like RTase and integrases) after entering macrophages. Instead, colocalization analysis of p24 (pseudoviruses), sACE2-Fc (B5-D3), and LAMP1 (lysosome) signals suggested probability of pseudovirus degradation in endosomes/lysosomes after cell entry (Figure 6a,c). Consistently, examination of the macrophages that had taken up pseudovirus showed that the Spike (S) proteins from the pseudovirus particles were not cleaved to release S2’ fragment at a distinct smaller size (Figure 6h). As the cleavage of S protein in host cells is critical for effective membrane fusion, it is essential and regarded as hallmark for successful viral entry and escape from endosome. Collectively, these data consistently indicated that the SARS-CoV-2 pseudoviruses were degraded directly in lysosomes after entering macrophages, showing no sign of ADE.

      (2) We thank the reviewer for his valuable suggestion and have performed RNA-seq analysis to profile immune responses in the treated macrophages.

      We performed RNA-Seq analysis to investigate major transcriptional changes in THP-1-derived macrophages after the pseudovirus infection, with or without B5-D3 treatments. Although no individual genes fulfilled the cutoff threshold of significant up-/down-regulation, we observed antiviral responses in the pseodovirus-B5-D3 treated samples by GSEA (new Supplementary Fig. 18). This observation indicated that the B5-D3 treatment and subsequent cell-entry of pseudovirusB5-D3 complexes into macrophages induced immune activation at moderate levels, but not evoking strong immune responses that can be harmful to the host.

      In the revised manuscript, we have included the new RNA-seq analysis results on macrophage infection tests on page 13 lines 317–322 and page 14 lines 323–325 in the main text and presented the relevant data in the new Supplementary Fig. 18. Furthermore, we agree that ADE is a critical issue and have further enriched our discussion on page 17 lines 415–417, to emphasize that the risk for ADE should be thoroughly evaluated to further develop the decoy strategy for human use.

      The exclusive use of the K18-hACE2 mouse model, which exhibits severe disease, limits the generalizability of the findings. The "complete protection" observed may not translate to models with more robust and naturalistic immune responses or to human physiology.

      We thank the reviewer for pointing out the limitation of the mouse model used.

      (1) Given that wild type mice are not susceptible to SARS and SARS-CoV-2 infection, transgenic mice have been generated to express hACE2, through various designs and strategies, serving as models for viral infection and drug development. However, many of these hACE2 transgenic mouse models exhibit mild infections due to moderate hACE2 levels, failing to develop the severity observed in SARS and COVID patients [14].

      (2) The K18-hACE2 transgenic mouse line (B6. Cg-Tg(K18-ACE2)2Prlmn/J, Jackson Laboratory) used in our study carries multiple copies of K18-hACE2 transgene cassette [15]. Compared to other hACE2 transgenic mouse models, this K18-hACE2 line shows higher expression of hACE2 in airway and other epithelia and supports severer infections by both SARS and SARS-CoV2 viruses, successfully causing lethality [16]. Hence, K18-hACE2 mice is a widely used model to study SARS and SARS-CoV2 virus infections and drug developments.

      (3) We agree that K18-hACE2 mice is a relatively weak transgenic line with poor productivity. However, it demonstrates best susceptibility to SARS-CoV-2 infection among established mouse models. In this study, we observed robust responses to SARS-CoV-2 infection in both aged and young cohorts, with all infected mice consistently demonstrating significant body weight loss during 4 dpi to 7 dpi (the PBS groups in Figure 2b, g)

      We agree with the reviewer that it would be more convincing to assess the efficacy of B5-D3 using additional animal models. However, we have some reservations about the importance of these additional tests. First, the generality of ACE2-Fc decoy concept and its efficacy have been reported in other studies using various models [17,18]. Moreover, different transgenic mice or animal models exhibit distinct kinetics in the pathogenesis process and immune responses to SAS-CoV-2 infections, which differ from that in human patients at varied aspects. Hence, given the limited capacity of animal facility, we chose to focus on the K18-hACE2 mice that have demonstrated most robust and convincing infection data, to investigate the potential of B5-D3 administered through various strategies as well as the underlying mechanisms for the full protection observed in IN prophylaxis.

      In the revised manuscript, we have further enriched our discussion regarding this limitation, on page 17 lines 417–422.

      Furthermore, the lack of data on circulating SARS-CoV-2 variants is a concern

      We thank the reviewer for his valuable comment.

      In this study, we have demonstrated the viral neutralization capacity of B5-D3, as a representative of the enhanced sACE2 decoy, using multiple pseudoviruses and authentic SARS-CoV-2, which collectively covered eleven variants (up to Omicron strains). Our results from both in vitro neutralization and PRNT experiments confirmed the robust resilience of B5-D3 against viral evolution (Figure 1c–g). This observation aligns well with other studies and is broadly supported by various investigations, as was pointed out below by the reviewer.

      Furthermore, studies on viral evolution have observed a robust trend that later-emerging SARS-CoV-2 variants exhibit a higher affinity for the ACE2 receptor, enhancing their infectivity and transmissibility [19]. Therefore, it is unlikely for a newly emerged SARS-CoV-2 variant to escape from B5-D3mediated neutralization.

      Collectively, all evidence consistently supports the principle of decoy design, B5-D3 (or other effective ACE2 decoys) possess the intrinsic ability to neutralize new circulating SARS-CoV-2 variants, as long as the virus variants rely on ACE2 receptor for cell entry. Hence, although further tests on circulating viral variants would add strengths to our study, the significance of this additional data may be limited.

      In the revised manuscript, we have further addressed this concern in the discussion, on page 16 lines 394–397.

      The concept of sACE2-Fc fusion proteins as decoy receptors is not novel, and numerous similar constructs have been previously reported. The manuscript would benefit from a clearer demonstration of how the optimized B5-D3 mutant represents a significant advance over existing sACE2-Fc designs.

      We thank the reviewer for his valuable comments.

      Indeed, previous research has reported multiple ACE2 mutations to enhance its binding to spike proteins and neutralization against SARS-CoV-2. However, combining ACE2 mutations based on in silico predictions to both enhance spike binding and eliminate the ACE2 enzymatic activity resulted in accumulated burdens. For instance, ACE2 decoy candidates with up to five mutations like K31F/N33D/H34S/E35Q/H345L [8] and L79F/M82Y/Q325Y/H374A/H378A [12] have demonstrated excellent potency to neutralize SARS-CoV-2 in both in vitro and in vivo assays. However, the extensive mutations could be associated with structural instability and reduced production efficiency [8,12]. Furthermore, the high mutation loads increase risks for immunogenicity, which is a critical issue in future clinical applications. Corroboratively, Urano et al. detected in vitro T cell stimulation elicited by the L79F mutation, whereas the T92Q mutation (included in our decoy design) showed much lower immunogenicity and enhanced spike binding affinity [20].

      In our ACE2 decoy design, we incorporated only two mutations (like T92Q and H374N in B5-D3) to enhance neutralization potency while eliminating enzymatic activity, resulting in simplest ACE2 mutants desired for engineering enhanced decoy. B5-D3, as one representative, not only exhibited minimal mutation-related risks (Supplementary Fig. 2i) but also top-level neutralization potencies among all candidate mutants tested (Figure 1, Supplementary Fig. 2f,g and Supplementary Fig. 3). To further address the safety of B5-D3 for in vivo use, we have performed prolonged in vivo overexpression of B5-D3 ACE2 decoy through AAV delivery in immune-competent K18-hACE2 mice, which indeed showed no sign of RAS disturbance or immune infiltration causing tissue damage. (In the revise manuscript, we have included these new results on page 5 lines 118–122 and page 6 lines 123–135 in the main text and presented the data in new Supplementary Fig. 4).

      Therefore, instead of demonstrating advantage over existing sACE2-Fc designs, our study used the optimized B5-D3 as a representative ACE2 decoy of top performers, to systematically examined various administration strategies as well as the underlying mechanisms for the full protection observed in IN prophylaxis. Aligned with this effort, our study identified 6-hours IN prophylaxis as the most effective regimen to confer complete protection against SARS-CoV-2 infection in K18-hACE2 mice. Further investigation through transcriptomics, bio-distribution, and phagocytosis analysis revealed that IN-delivered B5-D3 not only neutralizes viruses but also engaged airway phagocytes to promote early viral clearance and host immune activation, uncovering a distinct antiviral mechanism for the universal “decoy strategy” to combat unknown air-borne respiratory virus in the future.

      In the revised manuscript, we have further clarified our focus on using B5-D3 as a “representative” of ACE2 decoy on page 4 line 84, page 5 line 109, page 14 line 333, and page 15 line 360.

      A direct comparative analysis with previously published benchmarks, particularly in terms of neutralizing potency, Fc effector function strength, and in vivo efficacy, is necessary to establish the incremental value and novelty of this specific agent.

      We thank the reviewer for his valuable comments.

      Indeed, our study has aimed to address this concern and made partial progress through in vitro neutralization assays (Figure 1b and Supplementary Fig. 2c,d,f,g). Our results from the limited yet meaningful comparisons with the sACE2 lacking Fc domain and selected sACE2-Fc mutants published/proposed previously clearly demonstrated “substantial enhancement through Fc-fusion” (Supplementary Fig. 1d) and modest improvement from protein mutagenesis at ACE2-Spike interaction interface” (Figure 1b and Supplementary Fig. 2c,d,f,g).

      Based on the results from our various neutralization assays, we chose B5-D3 as a representative of enhanced decoy for in vivo infection, which identified 6-hours IN prophylaxis to confer complete protection against infection, demonstrating significant impact of administration strategies on in vivo efficacy of B5-D3 (Figure 2). Subsequent analysis further uncovered intriguing phenomena regarding the cellular distribution of IN-administered B5-D3 and the early immune activation triggered in the lung, which underlies the full protection by IN prophylaxis and represents an important novelty of this study.

      We agree with the reviewer that further analysis with additional benchmark versions would enhance the value of this study, but we have reservation regarding the importance. To enhance clarity, in the revised manuscript, we have further emphasized our study focus on using B5-D3 as a representative ACE2 decoy throughout the text and enriched the discussion on page 15 line 348–365.

      References

      (1) Ku Z, Xie X, Hinton PR, Liu X, Ye X, Muruato AE, Ng DC, Biswas S, Zou J, Liu Y, Pandya D, Menachery VD, Rahman S, Cao Y-A, Deng H, Xiong W, Carlin KB, Liu J, Su H, Haanes EJ, Keyt BA, Zhang N, Carroll SF, Shi P-Y & An Z. Nasal delivery of an IgM offers broad protection from SARS-CoV-2 variants. Nature 595, 718-723 (2021).

      (2) Liu J, Mao F, Chen J, Lu S, Qi Y, Sun Y, Fang L, Yeung ML, Liu C, Yu G, Li G, Liu X, Yao Y, Huang P, Hao D, Liu Z, Ding Y, Liu H, Yang F, Chen P, Sa R, Sheng Y, Tian X, Peng R, Li X, Luo J, Cheng Y, Zheng Y, Lin Y, Song R, Jin R, Huang B, Choe H, Farzan M, Yuen KY, Tan W, Peng X, Sui J & Li W. An IgM-like inhalable ACE2 fusion protein broadly neutralizes SARSCoV-2 variants. Nat Commun 14, 5191 (2023).

      (3) Guo H, Cho B, Hinton PR, He S, Yu Y, Ramesh AK, Sivaccumar JP, Ku Z, Campo K, Holland S, Sachdeva S, Mensch C, Dawod M, Whitaker A, Eisenhauer P, Falcone A, Honce R, Botten JW, Carroll SF, Keyt BA, Womack AW, Strohl WR, Xu K, Zhang N, An Z, Ha S, Shiver JW & Fu T-M. An ACE2 decamer viral trap as a durable intervention solution for current and future SARS-CoV. Emerging Microbes & Infections 12, 2275598 (2023).

      (4) Keyt BA, Baliga R, Sinclair AM, Carroll SF & Peterson MS. Structure, Function, and Therapeutic Use of IgM Antibodies. Antibodies 9, 53 (2020).

      (5) Chan KK, Dorosky D, Sharma P, Abbasi SA, Dye JM, Kranz DM, Herbert AS & Procko E. Engineering human ACE2 to optimize binding to the spike protein of SARS coronavirus 2. Science 369, 1261-1265 (2020).

      (6) Guy JL, Jackson RM, Jensen HA, Hooper NM & Turner AJ. Identification of critical active-site residues in angiotensin-converting enzyme-2 (ACE2) by site-directed mutagenesis. The FEBS Journal 272, 3512-3520 (2005).

      (7) Payandeh Z, Rahbar MR, Jahangiri A, Hashemi ZS, Zakeri A, Jafarisani M, Rasaee MJ & Khalili S. Design of an engineered ACE2 as a novel therapeutics against COVID-19. Journal of Theoretical Biology 505, 110425 (2020).

      (8) Glasgow A, Glasgow J, Limonta D, Solomon P, Lui I, Zhang Y, Nix MA, Rettko NJ, Zha S, Yamin R, Kao K, Rosenberg OS, Ravetch JV, Wiita AP, Leung KK, Lim SA, Zhou XX, Hobman TC, Kortemme T & Wells JA. Engineered ACE2 receptor traps potently neutralize SARS-CoV2. Proceedings of the National Academy of Sciences 117, 28046-28055 (2020).

      (9) Lei C, Qian K, Li T, Zhang S, Fu W, Ding M & Hu S. Neutralization of SARS-CoV-2 spike pseudotyped virus by recombinant ACE2-Ig. Nature Communications 11, 2070 (2020).

      (10) Maciuba S, Bowden GD, Stratton HJ, Wisniewski K, Schteingart CD, Almagro JC, Valadon P, Lowitz J, Glaser SM, Lee G, Dolatyari M, Navratilova E, Porreca F & Riviere PJM. Discovery and characterization of prolactin neutralizing monoclonal antibodies for the treatment of female-prevalent pain disorders. MAbs 15, 2254676 (2023).

      (11) Dwivedi V, Shivanna V, Gautam S, Delgado J, Hicks A, Argonza M, Meredith R, Turner J, Martinez-Sobrido L, Torrelles JB & Kulkarni V. Age associated susceptibility to SARS-CoV-2 infection in the K18-hACE2 transgenic mouse model. Geroscience 46, 2901-2913 (2024).

      (12) Chen Y, Sun L, Ullah I, Beaudoin-Bussières G, Anand SP, Hederman AP, Tolbert WD, Sherburn R, Nguyen DN, Marchitto L, Ding S, Wu D, Luo Y, Gottumukkala S, Moran S, Kumar P, Piszczek G, Mothes W, Ackerman ME, Finzi A, Uchil PD, Gonzalez FJ & Pazgier M. Engineered ACE2-Fc counters murine lethal SARS-CoV-2 infection through direct neutralization and Fc-effector activities. Science Advances 8, eabn4188 (2022).

      (13) Lund J, Winter G, Jones PT, Pound JD, Tanaka T, Walker MR, Artymiuk PJ, Arata Y, Burton DR, Jefferis R & Woof JM. Human Fc gamma RI and Fc gamma RII interact with distinct but overlapping sites on human IgG. The Journal of Immunology 147, 2657-2662 (1991).

      (14) Lutz C, Maher L, Lee C & Kang W. COVID-19 preclinical models: human angiotensinconverting enzyme 2 transgenic mice. Hum Genomics 14, 20 (2020).

      (15) McCray PB, Pewe L, Wohlford-Lenane C, Hickey M, Manzel L, Shi L, Netland J, Jia HP, Halabi C, Sigmund CD, Meyerholz DK, Kirby P, Look DC & Perlman S. Lethal Infection of K18hACE2 Mice Infected with Severe Acute Respiratory Syndrome Coronavirus. Journal of Virology 81, 813-821 (2007).

      (16) Oladunni FS, Park JG, Pino PA, Gonzalez O, Akhter A, Allue-Guardia A, Olmo-Fontanez A, Gautam S, Garcia-Vilanova A, Ye C, Chiem K, Headley C, Dwivedi V, Parodi LM, Alfson KJ, Staples HM, Schami A, Garcia JI, Whigham A, Platt RN, 2nd, Gazi M, Martinez J, Chuba C, Earley S, Rodriguez OH, Mdaki SD, Kavelish KN, Escalona R, Hallam CRA, Christie C, Patterson JL, Anderson TJC, Carrion R, Jr., Dick EJ, Jr., Hall-Ursone S, Schlesinger LS, Alvarez X, Kaushal D, Giavedoni LD, Turner J, Martinez-Sobrido L & Torrelles JB. Lethality of SARS-CoV-2 infection in K18 human angiotensin-converting enzyme 2 transgenic mice. Nat Commun 11, 6122 (2020).

      (17) Urano E, Itoh Y, Suzuki T, Sasaki T, Kishikawa JI, Akamatsu K, Higuchi Y, Sakai Y, Okamura T, Mitoma S, Sugihara F, Takada A, Kimura M, Nakao S, Hirose M, Sasaki T, Koketsu R, Tsuji S, Yanagida S, Shioda T, Hara E, Matoba S, Matsuura Y, Kanda Y, Arase H, Okada M, Takagi J, Kato T, Hoshino A, Yasutomi Y, Saito A & Okamoto T. An inhaled ACE2 decoy confers protection against SARS-CoV-2 infection in preclinical models. Sci Transl Med 15, eadi2623 (2023).

      (18) Higuchi Y, Suzuki T, Arimori T, Ikemura N, Mihara E, Kirita Y, Ohgitani E, Mazda O, Motooka D, Nakamura S, Sakai Y, Itoh Y, Sugihara F, Matsuura Y, Matoba S, Okamoto T, Takagi J & Hoshino A. Engineered ACE2 receptor therapy overcomes mutational escape of SARS-CoV-2. Nature Communications 12, 3802 (2021).

      (19) Cho MJ, Been NR & Son H. From Alpha to Omicron: Structural Insights into SARS-CoV-2 RBD Evolution and ACE2 Binding. European Journal of Public Health 35(2025).

      (20) Urano E, Itoh Y, Suzuki T, Sasaki T, Kishikawa J-i, Akamatsu K, Higuchi Y, Sakai Y, Okamura T, Mitoma S, Sugihara F, Takada A, Kimura M, Nakao S, Hirose M, Sasaki T, Koketsu R, Tsuji S, Yanagida S, Shioda T, Hara E, Matoba S, Matsuura Y, Kanda Y, Arase H, Okada M, Takagi J, Kato T, Hoshino A, Yasutomi Y, Saito A & Okamoto T. An inhaled ACE2 decoy confers protection against SARS-CoV-2 infection in preclinical models. Science Translational Medicine 15, eadi2623 (2023).

    1. eLife Assessment

      This study integrates large-scale behavioral, genetic, and molecular analyses in animal models to investigate morphine response. Utilizing high-quality, time-series Quantitative Trait Loci (QTL) mapping, the work provides compelling evidential support for novel, time-dependent genetic interactions (epistasis). A fundamental result of this rigorous analysis is the discovery of a novel Oprm1-Fgf12-MAPK signaling pathway, which offers new insights into the mechanisms of opioid sensitivity.

    2. Reviewer #1 (Public review):

      [Editors' note: this version has been assessed by the Reviewing Editor without further input from the original reviewers. The authors have appropriately addressed the comments raised in the previous round of review.]

      Summary:

      The study by Lemen et al. represents a comprehensive and unique analysis of gene networks in rat models of opioid use disorder, using multiple strains and both sexes. It provides a time-series analysis of Quantitative Trait Loci (QTLs) in response to morphine exposure.

      Strengths:

      A key finding is the identification of a previously unknown morphine-sensitive pathway involving Oprm1 and Fgf12, which activates a cascade through MAPK kinases in D1 medium spiny neurons (MSNs). Strengths include the large-scale, multi-strain, sex-inclusive design, the time-series QTL mapping provides dynamic insights, and the discovery of an Oprm1-Fgf12-MAPK signaling pathway in D1 MSNs, which is novel and relevant.

    3. Reviewer #2 (Public review):

      Summary:

      This highly novel and significant manuscript re-analyzes behavioral QTL data derived from morphine locomotor activity in the BXD recombinant inbred panel. The combination of interacting behavioral-pharmacology (morphine and naltrexone) time course data, high-resolution mouse genetic analyses, genetic analysis of gene expression (eQTLs), cross-species analysis with human gene expression and genetic data, and molecular modeling approaches with Bayesian network analysis produces new information on loci modulating morphine locomotor activity.

      Furthermore, the identification of time-wise epistatic interactions between the Oprm1 and Fgf12 loci is highly novel and points to methodological approaches for identifying other epistatic interactions using animal model genetic studies.

      Strengths:

      (1) Use of state-of-the art genetic tools for mapping behavioral phenotypes in mouse models.

      (2) Adequately powered analysis incorporating both sexes and time course analyses.

      (3) Detection of time and sex-dependent interactions of two QTL loci modulating morphine locomotor activity.

      (4) Identification of putative candidate genes by combined expression and behavioral genetic analyses.

      (5) Use of Bayesian analysis to model causal interactions between multiple genes and behavioral time points.

      Appraisal:

      The authors largely succeeded in reaching goals with novel findings and methodology.

      Significance of Findings:

      This study will likely spur future direct experimental studies to test hypotheses generated by this complex analysis. Additionally, the broad methodological approach incorporating time course genetic analyses may encourage other studies to identify epistatic interactions in mouse genetic studies.

    4. Reviewer #3 (Public review):

      Summary:

      This is a clearly written paper that describes the reanalysis of data from a BXD study of the locomotor response to morphine and naloxone. The authors detect significant loci and an epistatic interaction between two of those loci. Single-cell data from outbred rats is used to investigate the interaction. The authors also use network methods and incorporate human data into their analysis.

      Strengths:

      One major strength of this work is the use of granular time-series data, enabling the identification of time-point-specific QTL. This allowed for the identification of an additional, distinct QTL (the Fgf12 locus) in this work compared to previously published analysis of these data, as well as the identification of an epistatic effect between Oprm1 (driving early stages of locomotor activation) and Fgf12 (driving later stages).

    5. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      The study by Lemen et al. represents a comprehensive and unique analysis of gene networks in rat models of opioid use disorder, using multiple strains and both sexes. It provides a time-series analysis of Quantitative Trait Loci (QTLs) in response to morphine exposure.

      Strengths:

      A key finding is the identification of a previously unknown morphine-sensitive pathway involving Oprm1 and Fgf12, which activates a cascade through MAPK kinases in D1 medium spiny neurons (MSNs). Strengths include the large-scale, multi-strain, sex-inclusive design, the time-series QTL mapping provides dynamic insights, and the discovery of an Oprm1-Fgf12-MAPK signaling pathway in D1 MSNs, which is novel and relevant.

      Weaknesses:

      (1) The proposed involvement of Nav1.2 (SCN2A) as a downstream target of the Oprm1-Fgf12 pathway requires further analysis/evidence. Is Nav1.2 (SCN2A) expressed in D1 neurons?

      The authors mentioned that SCN8A (Nav1.6) was tested as a candidate mediator of Oprm1-Fgf12 loci and variation in locomotor activity. However, the proposed model supports SCN2A as a target rather than SCN8A. This is somewhat unexpected since SCN8A is highly abundant in MSN.

      Can the authors provide expression data for SCN2A, Oprm1, and Fgf12 in D1 vs. D2 MSNs?

      Author response image 1.

      We generated Author response image 1 to show both Scn2a and Scn8a are ubiquitously expressed in MSN and GABAergic neurons.

      (2) The authors should consider adding a reference to FGF12 in Schizophrenia (PMC8027596) in the Introduction.

      This is a relevant reference. We have cited it in the discussion section instead of introduction because we felt that is more relevant.

      (3) There is recent evidence supporting the druggability of other intracellular FGFs, such as FGF14 (PMC11696184) and FGF13 (PMC12259270), through their interactions with Nav channels. What are the implications of these findings for drug discovery in the context of the present study? Could FGF12 be considered a potential druggable therapeutic target for opioid use disorder (OUD)?

      The recent success in targeting FGF14 and FGF13 protein-protein interactions with sodium channels suggests that FGF12 could indeed be a druggable target for OUD. We have added a section to the Discussion exploring the potential for developing small-molecule modulators of the FGF12-Nav interface as a novel therapeutic strategy.

      Reviewer #2 (Public review):

      Summary:

      This highly novel and significant manuscript re-analyzes behavioral QTL data derived from morphine locomotor activity in the BXD recombinant inbred panel. The combination of interacting behavioral-pharmacology (morphine and naltrexone) time course data, high-resolution mouse genetic analyses, genetic analysis of gene expression (eQTLs), cross-species analysis with human gene expression and genetic data, and molecular modeling approaches with Bayesian network analysis produces new information on loci modulating morphine locomotor activity.

      Furthermore, the identification of time-wise epistatic interactions between the Oprm1 and Fgf12 loci is highly novel and points to methodological approaches for identifying other epistatic interactions using animal model genetic studies.

      Strengths:

      (1) Use of state-of-the art genetic tools for mapping behavioral phenotypes in mouse models.

      (2) Adequately powered analysis incorporating both sexes and time course analyses.

      (3) Detection of time and sex-dependent interactions of two QTL loci modulating morphine locomotor activity.

      (4) Identification of putative candidate genes by combined expression and behavioral genetic analyses.

      (5) Use of Bayesian analysis to model causal interactions between multiple genes and behavioral time points.

      Weaknesses:

      (1) There is a need for careful editing of the text and figures to eliminate multiple typographical and other compositional errors.

      We have performed a thorough review of the manuscript and corrected typographical errors, including "ddactivates" and other compositional issues.

      (2) There are multiple examples of overstating the possible significance of results that should be corrected or at least directly pointed out as weaknesses in the Discussion. These include:

      (a) Assumption that the Oprm1 gene is the causal candidate gene for the major morphine locomotor Chr10 QTL at the early time epochs. Oprm1 is 400,000 bp away from the support interval of the Mor10a QTL locus, and there is no mention as to whether the Oprm1 mRNA eQTL overlaps with Mor10a.

      We have clarified this in the text. While Oprm1 is located proximal to the peak, its massive size and the presence of a strong mRNA cis-eQTL in the NAc and hippocampus that precisely overlaps with the Mor10a QTL support interval provide robust evidence for its candidacy. We have added this detail to the Results section.

      (b) Although the Bayesian analysis of possible complex interactions between Oprm1, Fgf12, other interacting genes, and behaviors is very innovative and produces testable hypotheses, a more straightforward mediation analysis of causal relationships between genotype, gene expression, and phenotype would have added strength to the arguments for the causal role of these individual genes.

      We agree that mediation analysis would be a valuable addition. We revised the Results section to acknowledge that while the Bayesian network provides a comprehensive causal hypothesis, future studies employing formal mediation analysis could further strengthen these individual gene-to-behavior links.

      (c) The GWAS data analysis for Oprm1 and Fgf12 is incomplete in not mentioning actual significance levels for Oprm1 and perhaps overstating the nominal significance findings for Fgf12.

      We have updated the manuscript to include the specific significance levels for the human GWAS findings related to Oprm1 and Fgf12. We have clarified that the OPRM1 variant rs1799971 reached genome-wide significance (OR = 1.046, p = 4.92 × 10<sup>-9</sup>). Furthermore, we have ensured that the findings for FGF12 are described as nominally significant to avoid any overstatement of the results. For example, we now specify that the top FGF12 SNP rs1553460 achieved nominal significance (OR = 1.015, p = 0.021). The Results and Discussion sections have been revised to reflect these precise statistical values.

      Appraisal:

      The authors largely succeeded in reaching goals with novel findings and methodology.

      Significance of Findings:

      This study will likely spur future direct experimental studies to test hypotheses generated by this complex analysis. Additionally, the broad methodological approach incorporating time course genetic analyses may encourage other studies to identify epistatic interactions in mouse genetic studies.

      Reviewer #3 (Public review):

      Summary:

      This is a clearly written paper that describes the reanalysis of data from a BXD study of the locomotor response to morphine and naloxone. The authors detect significant loci and an epistatic interaction between two of those loci. Single-cell data from outbred rats is used to investigate the interaction. The authors also use network methods and incorporate human data into their analysis.

      Strengths:

      One major strength of this work is the use of granular time-series data, enabling the identification of time-point-specific QTL. This allowed for the identification of an additional, distinct QTL (the Fgf12 locus) in this work compared to previously published analysis of these data, as well as the identification of an epistatic effect between Oprm1 (driving early stages of locomotor activation) and Fgf12 (driving later stages).

      Weaknesses:

      (1) What criteria were used to determine whether the epistatic interaction was significant? How many possible interactions were explored?

      By design we only tested for epistasis between the Oprm1 and the Fgf12 loci—a single test of a non-linear interaction. As such there is no correction for multiple tests and no need for permutation. In other words the “nominal” P value in this case is the only relevant P value. We have added this clarification in the Results and Methods.

      (2) Results are presented for males and females separately, but the decision to examine the two sexes separately was never explained or justified. Since it is not standard to perform GWAS broken down by sex, some initial explanation of this decision is needed. Perhaps the discussion could also discuss what (if anything) was learned as a result of the sex-specific analysis. In the end, was it useful?

      We chose to analyze sexes separately AND jointly due to significant sex differences and sex by strain interactions in locomotion data. This rationale has been added to the results section. We also discussed sex-specific results in the revision.

      (3) The confidence intervals for the results were not well described, although I do see them in one of the tables. The authors used a 1.5 support interval, but didn't offer any justification for this decision. Is that a 95% confidence interval? If not, should more consideration have been given to genes outside that interval? For some of the QTLs that are not the focus of this paper, the confidence intervals were very large (>10 Mb). Is that typical for BXDs?

      The 1.5 LOD support interval is a standard metric for most QTL mapping studies, and does correspond approximately to a 95% confidence or support interval. Large intervals are common in BXD studies when effect sizes are moderate or recombination density is lower in specific regions. We have clarified the use of the 1.5 LOD interval in the Results section.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      In the vast majority of the figures, the text is too small to read.

      We have adjusted the font size in most of the figures.

      Reviewer #2 (Recommendations for the authors):

      (1) There is a need for careful editing of the text and figures to eliminate multiple typographical and other compositional errors. Examples of these include:

      (a) Figure 2E&F lacks identification of Oprm1 as the gene for cis-eQTL studies.

      (b) Figure 2H is fairly uninterpretable given the small font sizes. It should be excluded, put as a supplemental figure, or reconfigured to highlight the most important findings in a more legible manner.

      (c) Figure 4b: columns in the table need to be identified by a header row.

      We thank the reviewer for these comments and have addressed them in the revised version.

      Oprm1 is now labeled in Figure 2E and 2F, Figure 2G and 2H is now moved to the Supplementary material. And a header row is added to the table in Figure 4b.

      Reviewer #3 (Recommendations for the authors):

      Abstract

      (1) For the abstract, it might be simpler to name the alleles as "the C57BL/6J allele", etc., since B allele will confuse people unfamiliar with mouse nomenclature.

      It is critical to not confound the organism known as C57BL/6J with the genotype, allele, or haplotype that a mouse happens to inherit. Diverse types of mice inherit reference alleles but they may be only very distantly related the C57BL/6J strain. And even the C57BL/6J strain is a moving target that accumulates mutations that are not even consider reference. For example the mutation in Gabra2 of C57BL/6J is a de novo mutation that is not carried by many of the BXD strains since this mutation happened in JAX foundation stock after the BXDs were first established by Dr. Ben Taylor in the 1970s.

      The convention is to refer to mouse strains by one string and RRID, the abbreviation of that strain by a common code (often B6), and the abbreviation of the allele, genotype, or haplotype by the italic letter B. This has been the recommendation of the Mouse Nomenclature Committee (on which one of the authors has been a member) for well over 50 years.

      (2) I wondered if "also associated with a high B allele" could be reworded somehow; I had to re-read that sentence several times.

      This sentence has been reworded for clarity.

      (3) Parts of the abstract are written in the present tense, but then it switches to past ("we generated" but then "a Bayesian network analysis supports...").

      We have thoroughly revised the abstract. Following standard scientific writing conventions, we now utilize the past tense to describe the specific experimental actions and results of this study. We have maintained the present tense for established biological facts and the broader significance of the findings.

      (4) While the -log(p) values are all impressive, the abstract should indicate what threshold is used for genome-wide significance and how that threshold was obtained.

      We have added the significance threshold to the Abstract.

      (5) Do the details of the MAP kinase cascade need to be explained in the abstract? It feels like a lot of detail for an abstract and represents one of the most speculative aspects of the paper. Maybe just say you identified a possible network, but save the details for the main paper.

      This is a valid suggestion. We removed the specific MAP kinase from the abstract.

      Introduction

      (1) You could add a sentence explaining why using an LMM (GEMMA) was an improvement over the prior analysis.

      We have added a sentence explaining that GEMMA improves mapping power and better controls for population structure compared to previous methods.

      (2) When mentioning Philips 2010, you could indicate that it identified Oprm1. This might be easier than "In addition to Oprm1" which confused me at first because it had not been mentioned before, so 'in addition' was jarring.

      We have revised the text to state that Philip et al. (2010) originally identified the Oprm1 locus.

      Results

      (1) There are additional instances of the tense switching between past and present in the results section.

      We have standardized the tenses in the Results section.

      (2) "Ostn, Uts2d, Ccdc50, Gm10823, Fgf12, and Mb21d2" - before giving arguments for fgf12, can you clarify if there are coding variants or eQTLs for any of these genes?

      We have added a statement clarifying the coding variants for other genes in this interval and highlighting their eQTL status.

      (3) "a total number of 4,495 high-quality nuclei transcriptomes". Consider removing the word "number".

      Removed.

      (4) "approximately 6 males and 6 females" - could you point the reader to a supplementary table that has the exact number of individuals at the end of this sentence?

      The exact number of mice used in each of the BXD strains is not recorded in the original publication by Philip et al., with only mean and max was given. We have clarified that 6 is the average.

      (5) "computed using a subset" - please explain how you selected this subset (I assumed LD pruning, but why not be explicit. How many SNPs/markers were there originally, and how many are retained?

      We have specified that the subset of markers was selected via LD pruning to represent the genetic diversity of the BXDs.

      (6) A few words about how the significant threshold was obtained (permutation?) are needed.

      We have clarified that the significance threshold was obtained through 1,000 permutations.

      (7) Some of the GWAS results are presented for males and females separately (as well as combined). This is not typical, and so maybe a sentence explaining why the authors thought there might be sex specific GWAS results would be warranted.

      The rationale for sex-specific analysis is provided in the results section (significant sex difference and sex by strain interaction)

      (8) The correlation between the sexes of 0.68 could be evidence that there are sex-specific genetic effects, but could it also just be due to increased noise as you reduce sample size? What is the confidence interval for that number? Does it include 1? Or 0? If you randomly split the dataset, rather than splitting on the basis of sex, would you obtain higher correlations? The idea of sex differences is interesting, but a bit more work is needed to clarify these concerns.

      The correlation of 0.68 (95% CI: 0.52–0.79) significantly excludes both 0 and 1. The drop from r = ~0.86 at earlier intervals suggests a biological shift rather than noise due to sample size, as n remains constant (n = ~ 6 /sex/strain) across all time points. This divergence is driven by sex-specific genetic modifiers, such as the Fgf12 locus, which is more than twice as strong in females (LOD 10.6) as in males (LOD 4.3). We have addressed this in the revision.

      (9) Maybe I missed it, but how did you determine the threshold for significance for the epistatic interaction? Could you also clearly indicate how many possible cases of epistasis were examined/considered, since that dictates the correction for multiple testing.

      We only tested the interaction between the Fgf12 and the Oprm loci.

      (10) "To further examine whether Oprm1 and Fgf12 were co-expressed in the same cells of the NAc," can you first give an indication as to why you looked in NAc versus other brain areas you might have considered?

      We have added a sentence explaining that the NAc was chosen due to its central role in opioid reward and the observed strain differences in dopamine release in this region.

      (11) "...from every cell type conveyed a weak but significant positive correlation (r = 0.08, p = 1.8e-8) between the expression of Oprm1 and Fgf12 (Figure 7e). When we performed Pearson's correlation analysis within each individual cell cluster, only D1-MSN-3 had a significant positive correlation (r = 0.35, p = 6.1e-8, Figure 7f). In contrast, D1-MSN-2 had a significantly weak negative correlation (r = -0.12, p = 0.02, Figure 7g)." Can you explain why these correlations are relevant? What hypothesis are you testing?

      We have clarified that these correlations were used to test the hypothesis that Oprm1 and Fgf12 are co-expressed and potentially co-regulated within the same neuronal subtype to support their epistatic interaction.

      (12) "After the morphine locomotion tests were complete," can you give a specific timepoint? Like, was it exactly 180 minutes after the morphine injection?

      We have specified that naloxone was injected exactly 180 minutes after the morphine injection.

      (13) I appreciate the desire to relate the results of this paper to human GWAS results; however, I don't feel there is much worth discussing beyond the Oprm1 finding. Therefore, I would suggest removing this from the results section and instead just making it a discussion topic. The results presented are clearly the weakest part of this paper, and I personally think it is a shame to end the results section with something that is not very informative. But I suspect the authors may wish to retain this section, and I leave that decision to them and the editor.

      We have retained this section but moved some of the more speculative human data discussion to the Discussion section as suggested.

      Discussion

      (1) Typo "deactivates".

      Corrected to "activates".

      (2) The last sentence in the first paragraph again discusses the comparison to humans; I would remove this.

      That sentence is condensed.

      (3) "These data indicate that Oprm1 is a strong candidate gene for the Chr 10 locus associated with morphine-induced locomotion response." I would remind them of the eQTL for Oprm1 since this is a key piece of evidence supporting this gene as a candidate.

      We have added a reminder of the overlapping mRNA cis-eQTL for Oprm1.

      (4) "It is likely that differences in morphine-induced dopamine release are involved in the highly variable locomotor responses to morphine across the BXD family." I agree this might be true, but since you have no evidence to support this claim, is it worth mentioning at all?

      We have rephrased this as a hypothesis or cited relevant literature supporting this link in parental strains.

      (5) Could you include a sentence or two about why Philip 2010 didn't find Fgf12? Lack of markers? The difference between an LM and an LMM?

      We have added an explanation that the use of a high-density WGS-based marker set and the LMM (GEMMA) allowed for the detection of this novel locus that was previously missed.

      (6) Section titled "Cell-type specific gene expression in NAc". While this is interesting, you might also want to remind the reader that epistatic interactions do not necessarily require the genes to be expressed in the same cell or for their gene products to physically interact.

      We have added this caveat to the Discussion.

      (7) I think the Bayesian network section is not very strong. For example, they did not compare the results for their two chosen genes to the results they might have obtained if they had chosen other genes from their QTL intervals. My guess is that those other genes might have also produced results that were equally convincing. I'm not asking them to do that, but it reflects the risk of false positive results when taking an approach like this. Nevertheless, I am guessing the authors would prefer to include this section.

      We appreciate the reviewer pointing out this possibility and agree with this concern. We have added a statement acknowledging the risk of false positives in Bayesian modeling in this context and noting that these findings are intended as testable hypotheses

      Methods

      (1) How were the 2 HS rats selected? I had the impression that Dr. Telese's lab had access to snRNA-seq data from more than 2 HS rats.

      We have clarified that these rats were selected based on their addiction-like behavior phenotypes from a larger cohort.

      (2) I didn't look back, but did the main paper point out that the rats are treated with oxycodone rather than morphine?

      We have clarified this distinction in the Methods section.

    1. eLife Assessment

      This important study investigates how the nervous system adapts to changes in the mechanics of the body, which are altered through a tendon transfer surgery affecting finger extensor and flexor muscles. By measuring task performance, joint kinematics, and muscle activity for several weeks post surgery, the authors provide convincing evidence that monkeys undergo a two-phase adaptation process. First, they adopt a maladaptive strategy to overcome the functional challenges imposed by the surgery, and then revert to a strategy that uses the same patterns of muscle coactivation observed pre-tendon transfer.

    2. Reviewer #1 (Public review):

      Summary:

      Many studies have investigated adaptation to altered sensorimotor mappings or to an altered mechanical environment. This paper asks a different but also important question in motor control and neurorehabilitation: how does the brain adapt to changes in the controlled plant? The authors addressed this question by performing a tendon transfer surgery in two monkeys during which the swapped tendons flexing and extending the digits. They then monitored changes in task performance, muscle activation and kinematics post-recovery over several months, to assess changes in putative neural strategies.

      Strengths:

      (1) The authors performed complicated tendon transfer experiments to address their question of how the nervous system adapts to changes in the organisation of the neuromusculoskeletal system, and present very interesting data characterising neural (and in one monkey, also behavioural) changes post tendon transfer over several months.

      (2) The fact that the authors had to employ to two slightly different tasks -one more artificial, the other more naturalistic- in the two monkeys and yet found qualitatively similar changes across them makes the findings more compelling. After all these are very challenging experiments!

      (3) The paper is well written, the analyses are sound, and the authors interpret the data appropriately, acknowledging the key limitations.

      Weaknesses:

      None of note.

    3. Reviewer #3 (Public review):

      Summary:

      In this study, Philipp et al. investigate how a monkey learns to compensate for a large, chronic biomechanical perturbation--a tendon transfer surgery, swapping the actions of two muscles that flex and extend the fingers. After performing the surgery and confirming that the muscle actions are swapped, the authors follow the monkeys' performance on grasping tasks over several months. There are several main findings:

      - There is an initial stage of learning (around 60 days), where monkeys simply swap the activation timing of their flexors and extensors during the grasp task to compensate for the two swapped muscles.

      - This is (seemingly paradoxically) followed by a stage where muscle activation timing returns almost to what it was pre-surgery, suggesting that monkeys suddenly swap to a new strategy that is better than the simple swap.

      - Muscle synergies seem remarkably stable through the entire learning course, indicating that monkeys do not fractionate their muscle control to swap the activations of only the two transferred muscles.

      - Muscle synergy activation shows a similar learning course, where the flexion synergy and extension synergy activations are temporarily swapped in the first learning stage and then revert to pre-surgery timing in the second learning stage.

      - The second phase of learning seems to arise from making new, compensatory movements (supported by other muscle synergies) that get around the problem of swapped tendons.

      Strengths:

      This study is quite remarkable in scope, studying two monkeys over a period of months after a difficult tendon-transfer surgery. As the authors point out, this kind of perturbation is an excellent testbed for the kind of long-term learning that one might observe in a patient after stroke or injury, and provides unique benefits over more temporary perturbations like visuomotor transformations and over studying learning through development. Moreover, while the two-stage learning course makes sense, I found the details to be genuinely surprising--specifically the fact that: 1) muscle synergies continue to be stable for months after the surgery, despite being maladaptive; and 2) muscle activation timing reverts to pre-surgery levels by the end of the learning course. These two facts together initially make it seem like the monkey simply ignores the new biomechanics by the end of the learning course, but the authors do well to explain that this is mainly because the monkeys develop a new kind of movement to circumvent the surgical manipulation.

      I found these results fascinating, especially in comparison to some recent work in motor cortex, showing that a monkey may be able to break correlations between the activities of motor cortical neurons, but only after several of coaching and training (Oby et al. PNAS 2019). Even then, it seemed like the monkey was not fully breaking correlations but rather pushing existing correlations harder to get succeed at the virtual task (a brain-computer interface with perturbed control).

      Weaknesses:

      I found the analysis to be reasonably well considered and relatively thorough. The authors have also suitably addressed my comments on the previous version. One minor weakness that remains (understandably so) is that the two animals in the study performed different tasks, and the results of the secondary synergy analysis seem to be quite different (Figure 10). That said, I don't think this weakness reduces the impact of the study, and though multiple replications of the same results would provide more convincing evidence, I don't think it's necessary to make the points that the authors are making.

    4. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      (1) I think this is an important paper, but I’m puzzled about a tension in the results. On the one hand, it looks like the behavioural gains post-TT happen rather smoothly over time (Figure 5). On the other hand, muscle synergy activations change abruptly at specific days (around day ~65 for Monkey A and around day ~45 for Monkey B; e.g., Figure 6). How do the authors reconcile this tension? In other words, how do they think that this drastic behavioural transition can arise from what appears to be step-by-step, continuous changes in muscle coordination? Is it “just” subtle changes in movements/posture exploiting the mechanical coupling between wrist and finger movements, combined with subtle changes in synergies, and they just happen to all kick in at the same time? This feels to me to be the core of the paper and should be addressed more directly.

      We thank the reviewer for this insightful comment, as it touches upon the central finding of our study. The apparent tension between the smooth behavioral recovery and the abrupt shift in neural strategy is indeed a key feature of the adaptation process. We propose that this reflects the interaction of two distinct, parallel processes operating on different timescales:

      A slow, gradual skill-learning process, where the monkeys incrementally developed and refined a compensatory motor strategy (i.e., the tenodesis effect). This slow refinement is responsible for the smooth improvement seen in the behavioral metrics over many weeks.

      A fast, switch-like adaptive process, which governs the activation of the primary muscle synergies. The initial ‘swap’ strategy, while simple, was biomechanically conflicting and inefficient. The CNS only abandoned this flawed strategy abruptly once the slow learning process had rendered the new compensatory strategy “good enough” to be a viable alternative.

      Therefore, the abrupt neural shift does not cause the behavioral improvement but is rather enabled by the gradual, underlying development of a better motor solution. To address this important point more directly within the manuscript, we added a new subheading to the Discussion section. This section is dedicated to explicitly framing our findings within this multi-timescale learning model, ensuring the link between the gradual behavioral recovery and the abrupt neural shift is clearly articulated.

      (2) The muscle synergy analyses, which are an important part of the paper, could be improved. In particular:

      (a) When measuring the cross-correlation between the activation of synergies, the authors should include error bars and should also look at the lag between the signals.

      We thank the reviewer for these excellent suggestions to improve our analysis.

      Error Bars: We agree that showing trial-to-trial variability is important. In our revision, we have added a shaded envelope (representing the SD across trials) to the cross-correlation plots in Figures 6, 9 and 10.

      Time Lag: We have performed the cross-correlation analysis allowing for variable time lags and extracted the lag yielding the maximum correlation coefficient (max CC) for each session, in addition to the zero-lag correlation presented in the main figures. As hypothesized, allowing variable lags often resulted in high max CC values throughout the adaptation period, potentially obscuring the clear swap-and-revert pattern visible in the zerolag analysis. This is likely because the primary adaptation involved changes in synergy timing rather than fundamental shape. However, the analysis of the lag itself proved informative. We observed significant fluctuations in the optimal lag during the early and mid-adaptation phases, particularly around the time of the ‘switch-back’, before the lag stabilized closer to zero in the late phase.

      We have added a description of this analysis to the Methods section. The results of the lag analysis are now presented in a new Supplementary Figure S6 and S7, and a sentence summarizing this finding has been added to the Results section.

      (b) Figure 7C and related figures, the authors state that the activation of muscle synergies reverts to pre-TT patterns toward the end of the experiments. However, there are noticeable differences for both monkeys (at the end of the “task range” for synergy B for monkey A, and around 50% task range for synergy B for monkey B). The authors should measure this, e.g., by quantifying the per-sample correlation between pre-TT and post-TT activation amplitudes. Same for Figures 8I, J, etc.

      We thank the reviewer for this detailed and insightful suggestion. We agree that our use of the term ‘reversion’ should be nuanced, as the recovery of the synergy activation patterns is substantial but not perfect.

      To formally quantify these remaining differences, we performed a rigorous quantitative comparison between the pre-surgery and final-day post-surgery activation profiles. We calculated the Cosine Similarity to assess the recovery of the temporal shape, and used a Permutation Test (n=10,000) to test for statistical distinctness between the pre- and post-surgery trajectories.

      Results: We found that while the temporal shapes were highly similar (Cosine Correlation > 0.90 for all synergies), the Permutation Test confirmed that the profiles remained statistically distinct (p < 0.0001) in both animals.

      We have added this quantification to the text (Results). This confirms our nuanced interpretation: while the primary temporal features of the synergies reverted, the recovered motor program represents a novel, ‘good enough’ solution that is robust and functional, rather than a mathematically perfect restoration of the original baseline.

      (c) In Figures 9 and 10, the authors show the cross-correlation of the activation coefficients of different synergies; the authors should also look at the correlation between activation profiles because it provides additional information.

      We thank the reviewer for this comment and the opportunity to clarify our terminology. We agree that analyzing the correlation between the full activation profiles is the most informative approach. In our manuscript, the terms ‘activation coefficients’ and ‘activation profiles’ both refer to the complete, time-varying activation patterns of the muscle synergies. Therefore, the crosscorrelation analysis presented in Figures 9 and 10 is indeed the correlation between these full activation profiles. To prevent any potential ambiguity for future readers, we have revised the manuscript to use the term ‘activation profiles’ exclusively and consistently when referring to these time-varying synergy activations.

      (d) The muscle synergy analysis for Monkey B is hindered by the fact that the authors lost the ability to record from the (very) functionally relevant FDS muscle. I’d repeat the synergy analyses without this muscle to understand to what extent the observed changes with respect to baseline are driven by the lack of this data.

      We thank the reviewer for raising this important methodological point. We agree that controlling for changes in the recorded muscle set is crucial for a valid comparison between pre- and post-surgical synergy structures. The reviewer’s concern is based on the premise that the FDS muscle was included in the pre-surgical analysis for Monkey B but absent from the postsurgical analysis.

      We would like to clarify that this is not the case. Due to the loss of the FDS signal post-surgery, we made the deliberate decision to exclude the FDS muscle from ALL synergy analyses for Monkey B, including the pre-surgical baseline period. This was done for the precise reason the reviewer identifies: to ensure a direct and unbiased “apples-to-apples” comparison and to avoid introducing the lack of this muscle as a confound. Therefore, the changes in synergy structure that we report for Monkey B can be confidently attributed to genuine physiological adaptation rather than an artifact of a changing input dataset.

      (e) Figure 11: The authors talk about a key difference in how Synergy B (the extensor finger) evolved between monkeys post-TT. However, to me this figure feels more like a difference in quantity - the time course than quality, since for both monkeys the aaEMG levels pretty much go back to close to baseline levels - even if there’s a statistically significant difference only for Monkey B. What am I missing?

      We thank the reviewer for this insightful question, as it has prompted us to refine our interpretation of this key finding. The reviewer correctly notes that the recovery trajectories of Synergy B appear different, and we agree that our original explanation can be improved.

      A more parsimonious interpretation, and one that we believe aligns better with the data, is that both monkeys likely underwent a similar ‘arms race’, but we captured different phases of this process. In Monkey A, our recordings (starting Day 29) captured the escalating phase of this neuromuscular conflict. In contrast, for Monkey B, recordings began on Day 20, by which time this rapid escalation had likely already occurred and peaked. This difference in the timing of the ‘arms race’ is consistent with our behavioral observations; Monkey A struggled for a longer period before performing the task proficiently, suggesting a more protracted overall adaptation process. Thus, the apparent difference in the figures is likely a reflection of the observational window and the individual adaptation rate of each animal, rather than a fundamental qualitative difference in their adaptive strategy. We have revised the text to present this more unified and coherent interpretation.

      (f) Lines 408-09 and above: The authors claim that “The development of a compensatory strategy, primarily involving the wrist flexor synergy (Synergy C), appears crucial for enabling the final phase of adaptation”, which feels true intuitively and also based on the analysis in Figure 8, but Figure 11 suggests this is only true for Monkey B. How can these statements be reconciled?

      We believe the reviewer may be referring to Monkey A in their comment, as the strong compensatory effect is indeed seen in this animal. The core of this issue, which we have clarified in our revision, is that both monkeys developed a compensatory tenodesis grasp but used different neural strategies to achieve it.

      For Monkey A, strong evidence for this strategy is provided by a clear temporal shift in the activation of its dedicated wrist flexor synergy (Synergy C). As we have now clarified in the manuscript, the peak of this synergy’s activation moved from occurring just after object contact to just before it, a re-timing well-suited to enable a tenodesis grasp.

      For Monkey B, the strategy was one of subtle re-timing rather than scaling. While the total aggregated activation of its primary flexor synergy (Synergy A) did not significantly increase, its temporal profile shifted. Specifically, activation prior to object contact increased, providing the necessary wrist flexion for its assistive tenodesis grasp, which was kinematically confirmed in Figure 12. This was achieved by reallocating activation from the post-contact phase, resulting in an earlier activation peak for the synergy overall. Crucially, a finer-grained analysis reveals a precise temporal sequence within this synergy’s activation: the wrist flexor component (PL) consistently peaked just before object contact to enable hand opening, while the finger flexor component (FDP) peaked just after contact to secure the grasp.

      This timing resolves the apparent biomechanical conflict. It also reveals that while both monkeys converged on the same biomechanical solution (a tenodesis grasp), the observable neural implementation appeared different. However, we must be cautious in directly comparing the computed synergy structures themselves, as the analysis for Monkey B was performed without the FDS muscle. The apparent “multi-functional synergy” in Monkey B is most likely a consequence of this missing data. What is clear and robust, however, is that both monkeys converged on a remarkably similar temporal solution: they both learned to re-time the activation of their key wrist flexor muscles to the pre-grasp phase.

      In Monkey A, this was observed in the temporal shift of its dedicated wrist flexor synergy (Synergy C). In Monkey B, this was observed in the temporal shift of the Palmaris Longus (PL) muscle itself (which, in our computed synergies, was grouped into Synergy A). This convergence on an identical temporal adaptation, regardless of the computed modular organization, is the key finding. We have revised the manuscript to articulate this more precise and defensible interpretation.

      (3) Experimental design: at least for the monkey who was trained on the “artificial task” (Monkey A), it would have been good if the authors had also tested him on naturalistic grasping, like the second monkey, to see to what extent the neural changes generalise across behaviours or are task-specific. Do the authors have some data that could be used to assess this even if less systematically?

      We thank the reviewer for raising this important point regarding the generalizability of our findings across different behaviors. We fully agree that a direct comparison of both tasks in the same animal would have been a valuable experiment. Unfortunately, we do not have systematic data on naturalistic grasping for Monkey A that would allow for such a direct comparison. We therefore view the two tasks as providing complementary evidence. Monkey A’s data shows the adaptation process during a highly stereotyped behavior, while Monkey B’s data demonstrates that a similar two-phase adaptive process occurs during a more naturalistic, unconstrained task. The convergence of these findings strengthens our overall conclusion that this multi-timescale adaptation is a robust principle of motor learning. Nonetheless, the reviewer raises a fascinating question about the task-specific tuning of motor synergies, which remains an excellent direction for future studies.

      (4) Monkey B’s behaviour pre-tendon transfer seems more variable than that of Monkey A (e.g., the larger error bars in Figure 5 compared to monkey A, the fluctuating crosscorrelation between FDS pre and EDC post in Figure 6Q). This should be quantified to better ground the results since it also shows more variability post-TT.

      We thank the reviewer for this excellent suggestion to formally quantify the presurgery behavioral variability. We have performed the suggested analysis on the "Grip Formation Time" metric (Fig. 5A), which was the comparable metric between the two tasks. Our calculation of the Coefficient of Variation (CV) confirms the reviewer’s observation. Monkey B’s pre-surgery performance was substantially more variable (CV = 81.93%) than Monkey A’s (CV = 46.62%). Furthermore, a non-parametric test for equal variances (Ansari-Bradley test) confirmed that this difference is highly statistically significant (p < 0.0001). We have added a description of this analysis to the Methods and reported this finding in the Results section to provide a clearer context for the baseline differences between the subjects.

      (5) Minor: Figure 12 is interesting and supports the idea that monkeys may exploit the biomechanical coupling between wrist and fingers as part of their functional recovery. It would be interesting to measure whether there is a change in such coupling (tenodesis) over time, e.g., by plotting the change in wrist angle vs change in MCP angle as a scatter plot (one dot per trial), and in the same plot show all the days, colour coded by day. Would the relationship remain largely constant or fluctuate slightly early on? I feel this analysis could also help address my point (1) above.

      We thank the reviewer for this excellent and insightful suggestion. We have performed the suggested analysis for Monkey B, plotting the trial-by-trial relationship between wrist and MCP angles for all recording days (New Figure 13).

      The results clearly show the gradual refinement of the tenodesis coupling. Pre-surgery, there was no correlation (R²=0.00). Immediately post-surgery (Day 22), the relationship was weak and variable (R²=0.16), reflecting an exploratory phase. Over the following weeks, the coupling became progressively stronger and more consistent, with the R² value peaking at 0.58 around Day 56, indicating a robust exploitation of the new strategy. The relationship then stabilized at a moderate level (R² ~0.2-0.3) in the final days. This analysis provides direct kinematic evidence for the slow, gradual skill-learning component of our two-state model. It beautifully complements our response to the reviewer’s first point by visualizing the underlying refinement process that occurred concurrently with the more abrupt neural shifts. We have added this new figure and a description of these results to the manuscript.

      Reviewer #2 (Public review):

      Weaknesses:

      The most notable weakness of the study is the incompleteness of the data. [...] As a result, it is difficult to make general conclusions from the study, and it awaits further analysis or the addition of another subject.

      We thank the reviewer for this critical and accurate assessment of the study’s limitations. The reviewer is correct that the datasets for the two monkeys are incomplete in different ways and that the tasks were not identical. We fully acknowledge these limitations throughout the manuscript. Rather than viewing these differences as a weakness that prevents generalization, we propose that they offer a unique strength in the form of complementary evidence. We consider the two animals not as a direct replication, but as two distinct case studies that test the same underlying hypothesis under different conditions.

      Monkey A, with its high-quality EMG and highly stereotyped task, provides a detailed, quantitative view of the neural adaptation process, allowing us to precisely characterize phenomena like the ‘neuromuscular arms race’.

      Monkey B, with its kinematic data and more naturalistic task, provides crucial evidence that the same fundamental principles, a two-phase adaptation and the eventual development of a compensatory strategy, generalize to a less constrained, more behaviorally relevant context. We believe the key finding is the convergence of the results. Despite the differences in individual strategy, task demands, and available data, both animals demonstrated the same core "swapand-revert" adaptive process. We propose that this convergence from heterogeneous sources lends support to the generalizability of our conclusions, suggesting that the multi-timescale adaptation we describe may be a general feature of motor learning following such perturbations. We agree that future studies with more subjects are needed to fully establish this principle. Nonetheless, we feel that the convergent evidence from these two complementary cases provides a valuable foundation for the model we present.

      A second weakness is the insufficient analysis of the movements themselves, particularly for Monkey A. [...] Since the authors have video data for both monkeys, it is surprising that it was not used to extract landmarks for kinematic analysis, or at least hand/endpoint trajectory, and how it is adjusted over time. Adding more behavior data and aligning it with the EMG data would be very helpful for characterizing motor recovery and is needed to support conclusions about underlying neural control strategies for functional improvement.

      We thank the reviewer for this important suggestion. The reviewer’s comment prompted us to re-examine our behavioral data, and we have now performed additional analyses that we agree provide a much clearer link between the neural changes and functional recovery.

      For Monkey A, we have quantified the ‘pull times’ on a day-by-day basis. This analysis reveals a clear, gradual learning curve: pull times were initially long and variable post-surgery but steadily decreased and stabilized over the recovery period. This provides a direct, quantitative measure of motor performance recovery for this animal.

      For Monkey B, we have performed a detailed analysis of the ‘grasp aperture’ prior to object contact. This kinematic analysis is particularly revealing, as it shows the development of the compensatory strategy in real-time. The grasp aperture was initially very small post-surgery, reflecting the monkey’s inability to open its hand. It then steadily increased over the next ~40 days as the monkey learned and refined the compensatory tenodesis grasp, before stabilizing at a new, functional baseline.

      We believe these new analyses directly address the reviewer’s concern by providing a more detailed picture of motor recovery. The grasp aperture data, in particular, offers a clear kinematic correlate for the slow, skill-learning process that we propose runs in parallel to the more abrupt neural reorganization. We have added these results as a new figure in the main text of our revised manuscript.

      Considering specific conclusions, the statement that the monkeys learned to use “tenodesis” over time by increasing activation of a wrist flexor muscle synergy does not seem to be fully supported by the data. [...] Given these issues, it is not clear how to align the EMG and kinematic data and interpret these findings.

      We thank the reviewer for this detailed and critical analysis. They raise an excellent point and have correctly observed that the adaptation is not a simple, uniform increase in wrist flexor synergy amplitude. Our interpretation, which we have clarified in the manuscript, is that the monkeys learned a more sophisticated strategy: a precise re-timing of the wrist flexor activation to occur earlier in the movement, specifically to pre-shape the hand for the grasp.

      For Monkey A: The reviewer correctly notes that the peak amplitude of Synergy C (the wrist flexor synergy) around the moment of grasp (0% task range) is lower in the final phase compared to baseline. However, the crucial change is temporal: the peak of this synergy’s activation shifts from occurring just after the grasp (~+1%) to occurring just before it (~-2%). This re-timing is perfectly suited to enable finger extension via the tenodesis effect immediately prior to object contact. The subsequent lower amplitude may reflect a more efficient, less forceful movement once this new skill was refined.

      For Monkey B: The reviewer is right that this monkey does not have a dedicated wrist flexor synergy and that the overall amplitude of the PL muscle does not increase dramatically. However, a closer look at its activity profile (Fig. S2-AN) reveals a clear and consistent increase in activation specifically in the pre-contact phase (~7% task range). This is the precise neural signature of the assistive tenodesis grasp that is kinematically confirmed in Figure 12. The monkey is not simply scaling up the synergy; it is strategically activating it earlier to prepare for the grasp.

      In summary, the key evidence linking the EMG to the tenodesis strategy is in the temporal domain. The learned re-timing of the wrist flexor activation to the pre-grasp phase is the crucial link that aligns the neural and kinematic data. We have revised the manuscript to make this distinction between amplitude scaling and temporal shifting clearer.

      A more minor point regarding conclusions: statements about poor task performance and high energy expenditure being the costs that drive exploration for a new strategy are speculative and should be presented as such. Although the monkeys did take longer to complete the tasks after the surgery, they were still able to perform it successfully and in less than a second and no measurements of energy expenditure were taken.

      We thank the reviewer for this important point regarding the precision of our language. We agree that statements regarding ‘high energy expenditure’ and the specific drivers for exploring a new strategy are interpretations of the data, not direct measurements, and should be framed as such.

      Our speculation about energetic cost is based on the significant increase in muscle co-activation we observed (e.g., Fig. 11), a phenomenon widely understood to be metabolically expensive. Similarly, while the monkeys were still successful, their prolonged movement times and inefficient motor patterns represent a clear performance deficit compared to their highly optimized presurgical baseline, which we propose acted as a driver for further adaptation. In our full revision, we have carefully revised the manuscript to soften these claims. We have used more speculative language, such as “we hypothesize that...”, “the likely cost of...”, or “may have provided the impetus for...” to ensure that our interpretations are clearly distinguished from our direct empirical findings.

      A small concern is whether the tendon transfer effect may fail over time, either due to scar tissue formation or tendon tearing, and it would be ideal if the integrity of the intervention were re-assessed at the end of the study.

      We thank the reviewer for raising this important point regarding the long-term integrity of the tendon transfer. We agree that a terminal anatomical re-assessment would be an ideal control. While a terminal assessment was not performed as part of this study’s protocol, we were able to monitor the transfer’s integrity throughout the study. We are confident the transfer remained functionally intact for two key reasons:

      (1) Physical Monitoring: We periodically used ultrasound imaging to non-invasively visualize the tendon repair, which allowed us to confirm its continued physical integrity.

      (2) Functional Evidence: This physical confirmation was corroborated by the functional data. Both animals achieved stable, proficient task performance that was maintained for months. Furthermore, the late-phase neuromuscular control strategies became highly consistent. A significant failure, such as a tendon tear or prohibitive mechanical scarring, would be incompatible with this sustained behavioral and neural stability.

      Nevertheless, we agree that a terminal assessment is an excellent methodological suggestion that should be incorporated into the design of future long-term studies of this nature.

      Reviewer #3 (Public review):

      (1) First, I find myself wondering about the physical healing process from the tendon transfer surgery and how it might contribute to the learning. Specifically, how long does it take for the tendons to heal and bear forces? If this itself takes a few months, it would be nice to see some discussion of this.

      We thank the reviewer for this insightful question about the potential contribution of the physical healing process to the adaptation timeline. Our surgical protocol was specifically designed to ensure the tendon transfer was biomechanically robust from the outset, minimizing the role of healing as a rate-limiting factor.

      We used a Pulvertaft weave technique, which is known to achieve mechanical strength equivalent to that of a native tendon shortly after the procedure (Graham et al., 2023). The repair involved more than two weaves and utilized high-strength suture material to maximize its initial forcebearing capacity. While full fibrous integration around the suture site typically occurs within approximately six weeks, the repair itself was strong enough to bear physiological forces immediately post-surgery. Therefore, the prolonged, complex, two-phase multi-month behavioral recovery and the neural reorganization we observed cannot be attributed to a slow physical healing process. Instead, this supports our conclusion that the observed timeline reflects the challenges and constraints of a purely neural adaptation and skill-learning process. To make this crucial point clear to all readers, we have added these details about the surgical method to the Methods section and included a brief discussion of its implications in the Discussion.

      (2) Second, I see that there are some changes in the muscle loadings for each synergy over the days, though they are relatively small. The authors mention that the cosine distances are very small for the conserved synergies compared to distances across synergies, but it would be good to get a sense for how variable this measure is within synergy. For example, what is the cosine similarity for a conserved synergy across different pre-surgery days? This might help inform whether the changes post-surgery are within a normal variation or whether they reflect important changes in how the muscles are being used over time.

      We thank the reviewer for this excellent and insightful suggestion. Establishing a baseline for normal day-to-day variability is an important control for our synergy analysis.

      We have performed this analysis in full. Specifically, to quantify baseline stability, we calculated the cosine similarity between the spatial synergy weights (W) of each individual recording day and the pre-surgery average. This provides a rigorous measure of day-to-day variability relative to the stable baseline structure. We have added these data to Figure 7 (Panel I), which plots the pre-surgery similarity (blue traces) alongside the post-surgery adaptation (red traces).

      We found that baseline stability was remarkably high, with cosine similarity consistently exceeding 0.99 (e.g., Monkey A: 0.99 ± 0.001). This quantification allows the reader to formally assess that the changes observed post-surgery (e.g., drops to ~0.80 or ~0.60 in Monkey B) are well outside the range of normal physiological fluctuation, representing subtle but genuine structural adaptation.

      (3) Last, and maybe most difficult (and possibly out of scope for this work): I would have ideally liked to see some theoretical modeling of the biomechanics so I could more easily understand what the tendon transfer did or how specific synergies affect hand kinematics before and after the surgery. Especially given that the synergies remained consistent, such an analysis could be highly instructive for a reader or to suggest future perturbations to further probe the effects of tendon transfer on long-term learning.

      We thank the reviewer for this excellent and forward-thinking suggestion. We completely agree that a detailed biomechanical model of the tendon transfer would be a powerful tool for understanding the mechanical consequences of the surgery and for interpreting the function of the recorded muscle synergies. However, creating a subject-specific musculoskeletal model with the fidelity required to accurately simulate synergy-to-kinematic transformations is a highly complex project that we feel is well beyond the scope of the current manuscript. Such an endeavor would constitute a major research project in its own right.

      Our study’s primary focus was to provide a detailed, longitudinal characterization of the in-vivo neural adaptation following this perturbation, a dataset that is itself rare and valuable. We aimed to document the physiological learning process as it unfolded over many months. Nonetheless, the reviewer’s point is exceptionally well-taken. Currently, we are constructing a monkey musculoskeletal model and performing tendon transfer on this model to investigate what kind of characteristics in the learning process reproduce the synergy changes observed in the experiments. Although this project is still in progress, to date, we have demonstrated that the robustness of synergies themselves is necessary for changes in muscle activity at the synergy level (Nakajima N, Wang S, Ogihara N, Oya T, Seki K, Funato T, Upper Limb Musculoskeletal Model of Macaque Monkey for Approaching Adaptation Mechanism to Tendon Transfer, Society for Neuroscience 2023, Washington DC, USA, 2023).

      The rich dataset we have collected in the present research could serve as an excellent foundation for developing and validating such a model in the future. We believe that combining these two approaches is a critical and exciting next step for the field, and we have highlighted this as a key future direction in our discussion.

      Recommendations for the authors:

      Reviewing Editor Comments:

      When revising the manuscript for resubmission, please try to improve the visual presentation of the data, which is a point highlighted by all three reviewers during the discussion, including making the presentation of monkey-specific results more consistent across subjects.

      We have comprehensively revised the figures to ensure a consistent and clear visual presentation, as requested. Specifically, we standardized the layout across all main and supplementary figures (placing Monkey A consistently in the top rows or left columns and Monkey B in the bottom rows or right columns) and applied unified color schemes throughout the manuscript. Furthermore, we harmonized the presentation of the analytical results, such as the specific cross-correlation pairings in Figures 9 and 10, to ensure that the data for both subjects are presented with identical logic, facilitating direct comparison.

      Reviewer #1 (Recommendations for the authors):

      (1) Please revise the writing; some words are missing (line 90), and some sentences could be clarified slightly, even if the paper is well written (lines 317-320). The paragraph including the idea of tenodesis could also be further clarified, I think.

      Thank you for pointing these out. We have corrected the missing word (osteoarthritis) on line 90. We have also revised lines 317-320 to remove ambiguity. Furthermore, the section describing the tenodesis effect (now section "Distinct neural implementations...") has been substantially rewritten for improved clarity, incorporating a more detailed explanation of the biomechanics.

      (2) In the Introduction, the authors cite Hunter and Eckstein 2009 and Mercuri and Muntoni 2013 without describing the pathological conditions; this will not be clear for not nonspecialists.

      Thank you. We have added brief descriptions ("osteoarthritis, a degenerative joint disease," and "muscular dystrophy, which involves progressive muscle weakness,") directly into the Introduction sentence where these references appear.

      (3) Data presentation: I often thought that the data could be presented more clearly:

      (a) For example, Figure 3D and 4D should show error bars around the mean to have a sense of the consistency of pre-lesion behaviour. Same for other figures like Figure 6.

      We appreciate the reviewer's suggestion to visualize data consistency. (a) Figures 3D, 4D, and 6 (EMG Profiles): For these figures, we opted to display mean traces and peak markers to clearly illustrate the temporal shifts and relationships between muscles. Overlaying multiple standard deviation envelopes in these comparative plots would significantly reduce legibility. However, to fully address the reviewer's request to see the consistency of pre-lesion behavior, we direct attention to Supplementary Figure S1, which presents the complete EMG profiles with full error tubes (Mean ± SD) for every recorded muscle. (b) Quantitative Analysis Figures: We ensured that variability is explicitly visualized in all statistical analyses. The crosscorrelation time-courses in Figures 6 (G-Q), 9, and 10 are plotted with shaded error tubes to show variance. Similarly, the aggregated EMG analysis in Figure 11 utilizes bar plots with explicit error bars to quantify the statistical consistency of the changes.

      (b) The autocorrelation analysis in Figure 6 should also include measures of lag if it’s not at zero lag. If it’s the latter, please specify it in the Methods.

      We thank the reviewer for this question regarding the cross-correlation analysis presented in Figure 6 (Panels G-J, P-Q). We confirm that this analysis was performed at zero time lag. To clarify this, we have added a sentence to the Methods section (Subsection "Crosscorrelation analysis") explicitly stating that the EMG cross-correlations shown in Figure 6 were calculated at zero lag. We have also added a clarifying note ("at zero time lag") to the description of these panels within the Figure 6 caption.

      (c) Seeing EMG patterns similar to those presented in Figures 3D and 4D at different times post-lesion (e.g., as a Supplementary figure) would also give readers a better intuition of the neural changes.

      We thank the reviewer for this suggestion to provide more intuitive examples of the neural changes. We realize we did not sufficiently highlight this in the main text, but this complete data is already available in the manuscript. Supplementary Figures S1 and S2 provide a comprehensive overview of the EMG patterns for all recorded muscles in Monkey A and Monkey B, respectively. These figures show the pre-surgery and post-surgery average profiles for all recording sessions as well as the average profiles from five different post-surgery landmark days, covering the entire adaptation period. We have added explicit cross-references to these figures in the main text.

      (d) I couldn’t fully understand the analysis in Figure 4E; clarify.

      We thank the reviewer for noticing this oversight. The reviewer is correct that Figure 4E was not referenced in the main text. This panel was intended to show the baseline kinematic profiles (MCP and wrist angles) for Monkey B's control session, corresponding to the average EMGs shown in panel 4D. Given that our more comprehensive kinematic analyses are now presented in Figure 12 and the new Figure 13, we believe panel 4E is largely redundant. To improve the clarity and focus of Figure 4, we have removed panel 4E and its description from the revised manuscript.

      (e) Some figures showing neural changes (e.g., Figures 6G-J, 6P,Q, Figures 9 and 10, and even Figure 11 for different reasons) would become more understandable if they were accompanied by the behavioural changes (e.g., something like Figure 5A on top of them).

      We agree that visualizing the temporal link between neural reorganization and behavioral recovery is essential for interpreting the data. We have implemented this suggestion by overlaying behavioral metrics onto the right y-axes of Figures 6 (G-Q), 9, 10, and 11. However, regarding the specific behavioral metric, we opted to overlay the maladaptive behavior/aberrant reaching metric (from Figure 5B) rather than the grip formation time (Figure 5A). We found that the maladaptive behavior profile provided a clearer and more direct correlate to the neural data, as its peak coincides precisely with the ‘swapped’ synergy phase, thereby effectively illustrating the functional cost of that specific neural state.

      (f) Some figure captions could be improved by adding more detail (e.g., for Figure 6).

      We agree. We have substantially expanded and improved the captions for Figure 6 and Figure 7 to make them more self-contained and guide the reader more effectively through the key findings presented in the panels. We have also reviewed other captions for clarity.

      (g) I’d show the cosine distance between synergies across days as a main figure, e.g., as part of Figure 7, because this is an important result.

      We agree that the longitudinal stability of the synergy structures is a crucial result that deserves prominence. We have implemented this suggestion by adding a new panel, Figure 7 (I, K) for primary synergies and Figure 8 (K, L) for secondary synergies, which plots the cosine similarity of the spatial synergy weights across the entire experimental timeline. This figure explicitly visualizes the high stability of the pre-surgery baseline (blue traces, similarity > 0.99) and contrasts it with the dynamic structural tuning observed during the post-surgery adaptation (red traces), providing a clear, day-by-day account of synergy evolution as requested.

      (h) In Figure 7C, D and G, H, it’d be interesting to also see in the background the EMG for the transferred muscle that belongs to each synergy, to appreciate their relationship.

      We thank the reviewer for this suggestion. To illustrate the close relationship between the primary synergies and their key constituent muscles, while avoiding visual clutter in the complex post-surgery plots, we have modified the pre-surgery panels of Figure 7 (C, D, G, H). In these panels, we have now overlaid the average pre-surgery EMG profile of the primary transferred muscle belonging to that synergy (e.g., FDS for Synergy A, EDC for Synergy B) as a thin, gray, dashed line. This visually confirms the tight correlation between the synergy profile and the muscle’s activity at baseline.

      (i) In page 10, the authors report as maladaptive behaviour the duration of the aberrant reaching component from day 29 (monkey A) and day 20 (monkey B). What was happening before those recording dates? Were the monkeys recovering?

      Thank you for this question. We have added two sentences to the start of the Results section (“Functional Recovery Follows...”) clarifying that the period between surgery and formal recordings included approximately one week of home cage recovery followed by several weeks of assisted task practice. Formal recordings began once the monkeys could perform the task consistently without assistance.

      (j) In the Methods (EMG Analysis), the authors state that they resumed their recordings post-TT “once they (the monkeys) were able to perform the task on their own”. It would be good if the authors made this more precise (e.g., based on success rate or another metric).

      We thank the reviewer for this suggestion to increase precision. We have revised the Methods section to include the specific criteria used for resuming post-surgical recordings. Recordings were restarted once the monkeys were able to perform the task independently (i.e., without assistance from the experimenter) and consistently achieved a successful trial count of at least 100 trials within a single experimental session.

      (k) Line 266- reads “Alternation of EMG activity in non-transferred muscle suggests one possibility: TT might alter the control strategy of coordinated muscle activity for hand movement by modifying the transferred muscles and their agonists as a cohesive unit”, however, some “muscles showed patterns that were incompatible with a simple swap” (Lines 255-256). Doesn’t this observation suggest that what happens is not a simple change in muscle synergies?

      We thank the reviewer for this insightful question regarding the interpretation of muscles with adaptive patterns incompatible with the primary ‘swap-and-revert’. We agree that these observations require careful consideration within the modular framework. Our interpretation is that these muscles do not represent evidence against modular control, but rather reflect the involvement of multiple modules adapting concurrently. Specifically, muscles like FCR and PL, which showed distinct patterns, are primary members of Synergy C (the wrist flexor synergy) in Monkey A. Their adaptive profile is therefore consistent with the task-specific recruitment and retiming of Synergy C as part of the compensatory tenodesis strategy, rather than being a deviation from the swap observed in Synergies A and B. Synergies represent the dominant, shared variance in muscle activity. While they capture the overall strategy, some degree of individual muscle variation or the influence of secondary synergies is expected. We have added a sentence to the Results section to clarify that these diverse patterns likely reflect the differential involvement of muscles in multiple adapting synergies. We believe the overall evidence still strongly supports the modulation of stable synergies as the primary mechanism of adaptation in this paradigm.

      (l) You may want to call synergy A and synergy B, synergy F and synergy E to make recall easier? (Same for synergy C and D, which could be F2 and E2).

      We thank the reviewer for this helpful suggestion aimed at improving clarity. We considered renaming the synergies based on function (e.g., F/E). However, given the number of figures and the complexity of a global change, and the fact that the functional roles of Synergies C and D differed between animals, we decided to retain the original A/B/C/D labels for consistency. To ensure clarity for the reader, we have carefully checked the manuscript to ensure that we consistently define the primary functional role of each synergy (e.g., "Synergy A, the primary finger flexor synergy") when it is discussed.

      (m) Lines 315-317 - “These pattens of changes in synergy 3 and 4, both contributed minimally to the EMG of transferred muscles” -> This statement puts the causality as synergies cause muscles to activate according to certain patterns, which is supported by work by several groups -including the authors- however, they could also reflect biomechanical and task constraints as other have argued; perhaps this tone would be better for the discussion?

      We thank the reviewer for this nuanced point regarding the interpretation of synergy contributions. We agree that the causal relationship between computed synergies and muscle activity is complex and can reflect both neural commands and task constraints. To address this, we have revised the sentence in question in the Results section. Instead of stating that the synergies "contributed minimally," we now state that the changes in these synergies "were associated with minimal EMG activity in the transferred muscles." This phrasing is more descriptive of the observation and less implicitly causal, while retaining the key point within the flow of the results. The subsequent sentences, which offer interpretation, are already framed speculatively ("This suggests...", "may have served...").

      (n) Line 403 How do the authors conclude from the synergy patterns in Figure 11 that the early post-TT is characterised by “an unstable and inefficient neural control strategy”? To me, this is shown clearly in the behaviour, not in these plots, unless I’m missing something?

      We thank the reviewer for this comment, which highlights the need to clearly connect our neural findings to the behavioral outcome. The reviewer is absolutely correct that the behavioral data (Fig. 5) provides the most direct evidence of instability and inefficiency during the early adaptation phase. Our intention was to argue that the neural patterns observed in Figure 11 provide a physiological correlate for this behavioral inefficiency. Specifically, the escalating aggregated EMG activity observed in the conflicted extensor synergy (Synergy B), which we term the ‘arms race’, represents significant muscle co-activation. Such co-activation is widely understood to be energetically costly and reflects a suboptimal control strategy where the CNS is essentially "fighting itself" against the altered mechanics. To make this link clearer, we have revised the concluding sentence of the relevant paragraph in the Discussion ("The early adaptation phase...") to explicitly state that this escalating co-activation is a known marker of inefficient recruitment and that it occurred concurrently with the period of poor behavioral performance shown in Figure 5.

      (o) Lines 469-471. The authors suggest that muscle synergies may be preserved post-TT because a modular approach (to motor control) may be computationally easy and metabolically cheap. To me, recent data suggest that the most parsimonious explanation is what they later say: that the nervous system may not be plastic enough to change this (e.g., see Makin and Krakauer, “Against reorganisation” also in eLife).

      We thank the reviewer for raising this important theoretical point and for referencing the relevant literature on constraints on cortical reorganization. We agree that the preservation of muscle synergies in the face of such a profound perturbation is a key finding that warrants careful interpretation. In our revised Discussion (section "The CNS Defaults to a Modular Strategy..."), we have now explicitly incorporated the perspective that synergy stability may reflect inherent constraints on neural plasticity, citing Makin and Krakauer (2023), alongside our original hypothesis regarding computational and metabolic efficiency. We present these ideas not as mutually exclusive, but as potentially complementary factors that both contribute to the CNS’s apparent preference for modulating existing modules rather than fundamentally restructuring them.

      (p) Lines 501-503. Also on interpretation. Would the metabolic cost indeed be much higher? Couldn’t the observed change in strategy be explained purely based on performance metrics?

      This is an important point. We agree that statements regarding high energy expenditure are interpretations, not direct measurements. We have carefully revised the manuscript (Abstract, Results, and Discussion) to soften these claims, using more speculative language (e.g., "likely costly," "what we propose was...") to clearly distinguish our interpretations from direct empirical findings.

      (q) Lines 538-. The authors link the initial adaptation phase to the fast process reported in adaptation studies and say that this leads to poor retention. However, it seems from their data that the behaviour is stable across (early) days, so doesn’t this rule out such an interpretation?

      We thank the reviewer for this insightful question regarding the interpretation of the early adaptive phase within the two-state model framework. The reviewer correctly notes that the early post-surgical behavior, while maladaptive, appeared relatively stable across days and did not show the rapid decay sometimes associated with the "poor retention" characteristic of the fast system. We agree that this apparent stability requires careful interpretation. In our revised Discussion (section "A Multi-Timescale Model..."), we now propose that the fast system is primarily responsible for the initial, rapid adoption of the ‘swap’ strategy in response to the large error signal. The subsequent persistence of this flawed but stable state for several weeks is likely not due to strong retention by the fast system itself, but rather reflects the time required for the parallel slow system to gradually develop a more effective compensatory strategy (i.e., the tenodesis grasp). Once this alternative strategy became viable, it enabled the abrupt "switchback," which we also attribute to the fast system recalibrating away from the highly costly swap strategy. Therefore, we believe our data is consistent with the involvement of a fast system driving rapid strategic shifts, even if the typical "poor retention" phenotype is masked by the lack of a viable alternative strategy during the early phase.

      Reviewer #2 (Recommendations for the authors):

      (1) The discussion would benefit greatly from a more careful comparison with prior work characterizing the response to experimental or clinical tendon or nerve transfer in different models.

      We thank the reviewer for suggesting these important references and for the recommendation to compare our findings more carefully with prior work. This is an excellent point, and we agree it will significantly strengthen the discussion. In our full revision, we have added a new paragraph to the Discussion section dedicated to this comparison. We discuss how our findings relate to classic work showing primate adaptive capacity beyond simple maladaptive responses (Sperry, 1947), EMG evidence for the persistence of original neural patterns alongside new ones in human patients (Illert et al., 1986), the critical role of altered peripheral biomechanics and myofascial force transmission in complicating adaptation (Maas & Huijing, 2012), and how our observation of synergy stability aligns with evidence for modular adaptation strategies (Berger et al., 2013). This comparison helps situate our unique findings of a multi-timescale process and synergy timing modulation within the broader context of motor relearning after musculoskeletal rearrangement.

      (2) Line 90 - Which disease or condition is studied in Hunter and Eckstein (2009)?

      Thank you. We have clarified this in the Introduction; the reference pertains to osteoarthritis.

      (3) Line 280 for clarity in text and as a reminder to the readers, please state which muscles are involved in each synergy grouping.

      We have updated the text (Results, 'Adaptation occurs through modulating...') to explicitly list the main contributing muscles for each synergy grouping (e.g., Synergy A: FDS and FCU for Monkey A). This provides the requested clarity regarding the functional identity of each synergy while maintaining readability. For the complete, quantitative muscle weight composition including minor contributors, we referred the reader to Figure 7 and Supplementary Table 1.

      (4) Line 180 There are differences in the time course for measurements between the behavioral metrics and EMGs. If not recorded at fixed time intervals, the differences in the time courses for the two monkeys should be explained.

      We thank the reviewer for this question regarding the time courses of our measurements. We interpret this comment in two ways, both of which we have addressed in the revised manuscript.

      First, if the reviewer is asking about the overall recording schedule, they are correct that sessions were not performed at fixed daily intervals, and the specific days sampled differed between monkeys. This non-uniform sampling was due to the practical constraints of longterm behavioral experiments (e.g., animal cooperation, scheduling, weekends) and the aim to capture data during key phases of adaptation. However, within any given session, behavioral (video) and EMG data were always collected concurrently.

      Second, if the reviewer is asking whether the set of days included differs between the behavioral plots (e.g., Fig 5) and the EMG/synergy plots (e.g., Figs 6, 9-11), this is a possibility depending on data quality criteria. Our criterion for including a session in the behavioral analysis was a minimum of 20 successful trials. However, for the more demanding synergy analysis, we required a higher minimum of 100 successful trials to ensure robust factorization. It is possible that a few sessions met the behavioral criterion but not the synergy criterion and were thus excluded from the latter analysis, leading to slight differences in the days presented across figures. To ensure full clarity, we have added text to the Methods section explicitly stating: (A) the rationale for the non-uniform daily sampling schedule, and (B) the specific minimum trial count criteria used for including data in the behavioral versus the synergy analyses, noting if this resulted in different sets of days being analyzed for different figures.

      (5) General figure comments - The figures are informative, but they could be better presented, designed, and formatted to explain the important results in the paper. The figures should be able to explain most of the key results without entirely referring to the text to find some of the details. I had a bit of trouble understanding Figure 9 & 10. I would also like to suggest that bringing raw data into some figures (e.g., EMG of different muscle groups), such as showing stability between the synergies, could improve the results and allow the story to flow with more clarity. Likewise, clearly showing the differences between baseline EMG measurements and post-surgery measurements could improve some of the result figures.

      We thank the reviewer for these important general comments on data presentation. We agree that the figures are the key to our story and are implementing several revisions based on this and other reviewer feedback to improve their clarity.

      General Presentation: We have conducted a thorough review of all figures to improve layout, consistency, and font legibility (addressing R3, 1 and the Reviewing Editor's comments). This includes adjusting the layouts of Figures 3, 4, and 6 for better alignment and clarity.

      Figures 9 & 10 (Cross-correlation): The reviewer mentioned having trouble understanding these figures. In our revision, we have substantially rewritten the captions for Figures 9 and 10 to be much more descriptive. We explicitly walk the reader through how to interpret the plots (e.g., "The ‘swap’ is evidenced by the drop in self-correlation... and a concurrent rise in antagonist-correlation...").

      Including "Raw Data" (EMG): We thank the reviewer for this suggestion to provide more intuitive examples of the neural changes. We realize we did not sufficiently highlight this in the main text, but this complete data is already available in the manuscript. Supplementary Figures S1 and S2 provide a comprehensive overview of the EMG patterns for all recorded muscles in Monkey A and Monkey B, respectively. These figures show the pre-surgery and post-surgery average profiles for all recording sessions as well as the average profiles from five different post-surgery landmark days, covering the entire adaptation period. These figures directly visualize the swap-and-revert pattern in the transferred muscles and their agonists (e.g., EDC, ED23), as well as the diverse and complex adaptations in other nontransferred muscles (e.g., FCR, PL), as requested. To make this clearer, we have added explicit cross-references to Supplementary Figures S1 and S2 within the main Results section to ensure readers are directed to this detailed data.

      Showing Differences (Pre vs. Post): To "clearly show the differences between baseline... and post-surgery measurements," we implemented the point-by-point statistical comparison of pre- vs. final-day synergy profiles (as suggested in R1, 2b). This has resulted in a new Supplementary Figure visually highlighting the precise periods in the task where the final profiles still differ significantly from baseline (Fig. S9).

      We believe these additions (new figures and improved captions) will make the results much clearer and more self-explanatory, as the reviewer suggested.

      (6) Figure 1 A table with all the acronyms would help with identifying all the muscles and their respective synergies (supplemental), especially when describing the muscles in the result of the discussion section.

      This is an excellent suggestion. We have created a comprehensive table (Supplementary Table 1) listing all muscle abbreviations, full names, primary functional groups, and assigned synergies for both monkeys. We have added a reference to this table in the Figure 1 caption and the Methods section.

      (7) Figure 2 - is this mainly from Monkey A? If so, it should be stated.

      We thank the reviewer for pointing out this omission. We have updated the caption for Figure 2 to clarify that the example data shown (ultrasound, trajectories, and quantitative plots) are from Monkey A.

      (8) Figure 3 & Figure 4 seems unbalanced because of the descriptive need to explain Monkey B’s tasks? The figure alignments could be better.

      We thank the reviewer for this comment on the visual presentation of Figures 3 and 4. The reviewer’s observation that the figures appeared ‘unbalanced’ was correct. This was a direct consequence of two issues: (1) the different tasks required slightly different schematics (the "descriptive need" the reviewer mentioned), and (2) the original Figure 4 contained an additional kinematic panel (formerly 4E) that was unique to Monkey B, which broke the parallel structure with Figure 3.

      To address this and significantly improve the alignment, we have now moved the unique kinematic panel (formerly 4E) to a new Supplementary Figure (Supplementary Figure S8). This change has allowed us to re-arrange the panels in Figures 3 and 4 so that they now follow the exact same order. We have also adjusted the layout to ensure that corresponding panels are of a consistent size. We agree that this creates a much better visual balance and makes the comparison between the two monkeys far more direct and clear, as the reviewer suggested.

      (9) Figure 5. It seems like the animals can still perform the task post-surgery, but with high variability. Maybe emphasize the differences in variability between baseline and postsurgery?

      We thank the reviewer for this suggestion to emphasize the changes in variability. We have now quantified this using the Coefficient of Variation (CV) for key behavioral metrics across different phases (Pre-surgery, Early, Mid, Late post-surgery). The results confirm the reviewer’s observation of high variability post-surgery, particularly in the early phase. For instance, Monkey A’s grip formation time CV spiked dramatically (Pre: 47% vs Early: 133%), while Monkey B’s remained high (Pre: 82% vs Early: 76%). Interestingly, while Monkey A’s variability returned close to baseline levels in the late phase (Late: 55%), Monkey B’s variability increased further (Late: 97%), suggesting persistent inconsistency despite functional recovery.

      We also observed metric-specific changes. Monkey A’s pull time became less variable than baseline later on (Pre: 65% vs Late: 43%), suggesting refinement of that action. Conversely, Monkey B’s grasp aperture remained consistently low throughout (Pre: 26% vs Late: 19%), indicating relatively precise kinematic control was maintained or quickly regained. We have added a summary of these findings to the Results section to provide a more complete picture of how behavioral variability evolved relative to baseline during the adaptation process.

      (10) Figure 6 quite a confusing figure. This figure needs to be better presented. The figure legends are hard to see for Monkey A vs Monkey B. At first, I thought Monkey B’s figure legend also represented Monkey A. I would suggest reorganizing the figures for clarity and coherence.

      We agree that the original presentation of Figure 6 was dense and potentially confusing. We have completely reorganized the figure to improve clarity and coherence.

      (1) Clear Separation: The figure is now structured with a strict separation between Monkey A (Left Panels, A-J) and Monkey B (Right Panels, K-Q), with prominent headers for each subject to prevent ambiguity.

      (2) Improved Legends: We have redesigned the legends to be larger and placed them explicitly within their respective subject’s section to ensure it is immediately clear which data they describe.

      (3) Visual Consistency: We have standardized the color schemes and axis layouts across this and all other figures to reduce cognitive load and facilitate easier comparison between subjects.

      (11) Figure 12 - This figure is incomplete without Monkey A’s results. The videos in the supplemental sections seem clear enough for some kinematic analysis. The story could be more supported with more thorough measurements of the kinematics from both animals to show how they differ over time and by highlighting the two phases. As a minor note, it would be helpful to present the kinematic data together with a schematic of when during the task the data are drawn from, using the % task range scale, since that is the standard throughout the paper.

      We thank the reviewer for their suggestions regarding the kinematic analysis. We agree that a parallel kinematic analysis for Monkey A, similar to that in Figure 12, would be ideal. We did attempt this. Unfortunately, while the supplemental videos for Monkey A are sufficient for observing the overall movement trajectory, they are not suitable for the detailed joint angle analysis the reviewer suggests. The videos for Monkey A were recorded at an insufficient frame rate that did not allow to reliably extract the rapid joint angle positions of the wrist and fingers during the grasping movement. This is the reason why this detailed kinematic analysis was limited to Monkey B, for which we had high-speed video recorded at 240 fps, allowing for a robust analysis of these fast movements.

      We have, however, expanded our kinematic analysis for Monkey B to show the refinement of the tenodesis strategy over the full time course (New Figure 13), which does help to highlight the different adaptive phases for that animal. We have also clarified in the manuscript (e.g., in the caption for Figure 12) that the lack of Monkey A data for this specific analysis was due to the lowresolution and low-frame-rate video available.

      We agree that defining the precise timing of the kinematic snapshot relative to our normalized task range is critical for accurate interpretation. In response, we have added a new panel (Figure 12C) that explicitly maps the kinematic snapshot to our standardized task timeline. This schematic clarifies that the joint angle analysis captures the hand configuration during the pre-shaping phase, specifically at 83 ms prior to object contact (which corresponds to -0.02% of the normalized task range). This ensures the kinematic data can be directly interpreted within the same temporal context as the EMG and synergy results presented throughout the paper.

      Reviewer #3 (Recommendations for the authors):

      First and most major: I found many of the figures much too small and incredibly difficult to read. Possibly the most difficult was Figure 7, where I had to zoom in a great deal to read what muscles corresponded to which bars. I don’t have specific suggestions here other than to make sure that figures are legible.

      We thank the reviewer for highlighting this important issue. We have comprehensively revised the figures to ensure they are legible at standard publication sizes. Specific improvements include:

      (1) Figure 7: We have significantly increased the font size of the x-axis muscle labels and optimized the bar chart spacing to ensure the muscle identities are readable without excessive zooming.

      (2) Global Updates: Across all figures, we have increased font sizes for axis labels and titles, removed unnecessary whitespace to maximize the data-to-ink ratio, and exported all final figures in high-resolution vector formats to ensure clarity.

      Second and more minor: I liked the setup of the manuscript, where the authors explained the unique benefits of their experimental methods and the question they were going after (“When confronted with structural changes to the musculoskeletal system, does the CNS adapt by modulating existing synergies, or by shifting toward more fractionated control strategies?”). However, the evolution of the paper made the answer to this question seem very confusing to me as I read it. The results show that monkeys initially modulated existing synergies in phase 1, but then reverted to the original modulation. This, in addition to the way the question was set up initially, made me think the conclusion was going to be that the synergies themselves changed in the second phase, but this paradoxically was not the case--synergies were stable throughout. I was left confused for the back half of the results section, until the discussion on tenodesis and developing compensatory movement strategies. So the answer is that the monkey learns by modulating existing synergies, but using different strategies in different learning phases. I’m not entirely sure how to avoid this confusion, but I wonder if there’s a way to foreshadow this finding earlier on.

      We thank the reviewer for this valuable feedback on the manuscript’s narrative structure. We understand how the initial framing (modulation vs. fractionation) followed by the reversion of the initial modulation could lead to confusion before the compensatory strategy is fully introduced. To address this, we have made two key adjustments in the revised manuscript:

      (1) In the Introduction, after posing the central question, we have added a sentence to subtly foreshadow that the adaptive process might be complex and multi-phasic, requiring analysis over extended timescales.

      (2) In the Results section, at the transition point between describing the reversion of the primary synergy timings and introducing the compensatory tenodesis strategy, we have added a short paragraph to explicitly signal that the reversion was not the complete solution and that a distinct compensatory strategy emerged concurrently.

      We believe these changes improve the narrative flow, provide better signposting for the reader, and mitigate the potential for confusion identified by the reviewer, making it clearer that the ultimate solution involved modulating existing synergies but via different strategies across distinct learning phases. We appreciate the reviewer’s help in identifying this area for improvement.

    1. eLife Assessment

      This useful study uses a chemoinformatics pipeline to identify a list of candidate mosquito repellants that may be pleasant to smell and safe for humans. The strength of evidence and in particular the computational methodology are incomplete because it is insufficiently benchmarked against other leading models. At the high concentrations tested, there may also be off-target effects of the repellents on the mosquitoes that are not considered.

    2. Reviewer #1 (Public review):

      Summary:

      In this manuscript, the authors set up a pipeline to predict insect repellents that are pleasant and safe to humans. This is done by daisy chaining a new classification model based predicting repellents with a published model on predicting human perception. Models use a feature-engineered selection of chemical features to make their predictions. The predicted molecules are then validated against a proxy humanoid (heated brick) and its safety is tested by molecular assays of human cells. The humanistic approach to modeling these authors have taken (which consider cosmetic/aesthetic appeal and safety) is novel and a necessary step for consumer usage. However, the importance of pleasantness over effectiveness is still up for debate (DEET is unpleasant but still used often) and the generalization of safety tests is unknown and assumed. The effectiveness of the prediction models is also still warranted. They pass the authors own behavioral tests, but their contribution to the field is unknown as both models (new and published) have not been rigorously bench-marked to previous models. Moreover, the author's breadth of literature in this field is sparse, ignoring directly related studies.

      Strengths:

      Humanistic approach to modeling consider pleasantness and safety. Chaining models can help limit the candidate odorants from the vastness of odor space.

      Weaknesses:

      The current models need to be bench-marked against leading models predicting similar outcomes. Similarly, many of these papers need to be addressed and discussed in the introduction. The authors might even consider their data sources for model training to increase performance and lexical categorization for interoperability. For instance, the Dravnikes data lexicon, currently used in the human perception lexicon, has been highly criticized for its overlapping and hard to interpret descriptive terms ("FRAGRANT", "AROMATIC").

      Human Perception<br /> Khan, R. M., Luk, C. H., Flinker, A., Aggarwal, A., Lapid, H., Haddad, R., & Sobel, N. (2007). Predicting odor pleasantness from odorant structure: pleasantness as a reflection of the physical world. Journal of Neuroscience, 27(37), 10015-10023.

      Keller, A., Gerkin, R. C., Guan, Y., Dhurandhar, A., Turu, G., Szalai, B., ... & Meyer, P. (2017). Predicting human olfactory perception from chemical features of odor molecules. Science, 355(6327), 820-826.

      Gutiérrez, E. D., Dhurandhar, A., Keller, A., Meyer, P., & Cecchi, G. A. (2018). Predicting natural language descriptions of mono-molecular odorants. Nature communications, 9(1), 4979.

      Lee, B. K., Mayhew, E. J., Sanchez-Lengeling, B., Wei, J. N., Qian, W. W., Little, K. A., ... & Wiltschko, A. B. (2023). A principal odor map unifies diverse tasks in olfactory perception. Science, 381(6661), 999-1006.<br /> Related cleaned data: https://github.com/BioMachineLearning/openpom

      Insect Repellents:<br /> Wright, R. H. (1956). Physical basis of insect repellency. Nature, 178(4534), 638-638.

      Katritzky, A. R., Wang, Z., Slavov, S., Tsikolia, M., Dobchev, D., Akhmedov, N. G., ... & Linthicum, K. J. (2008). Synthesis and bioassay of improved mosquito repellents predicted from chemical structure. Proceedings of the National Academy of Sciences, 105(21), 7359-7364.

      Bernier, U. R., & Tsikolia, M. (2011). Development of Novel Repellents Using Structure− Activity Modeling of Compounds in the USDA Archival Database. In Recent Developments in Invertebrate Repellents (pp. 21-46). American Chemical Society.

      Wei, J. N., Vlot, M., Sanchez-Lengeling, B., Lee, B. K., Berning, L., Vos, M. W., ... & Dechering, K. J. (2022). A deep learning and digital archaeology approach for mosquito repellent discovery. bioRxiv, 2022-09.

      The current study assumes that insect repellents repel via its odor valence to the insect, but this is not accurate. Insect repellents also mask the body odor of humans making them hard to locate. The authors need to consult the literature to understand the localization and landing mechanisms of insects to their hosts. Here, they will understand that heat alone is not the attractant as their behavioral assay would have you believe. I suggest the authors test other behaviors assays to show more convincing evidence of effectiveness. See the following studies:

      De Obaldia, M. E., Morita, T., Dedmon, L. C., Boehmler, D. J., Jiang, C. S., Zeledon, E. V., ... & Vosshall, L. B. (2022). Differential mosquito attraction to humans is associated with skin-derived carboxylic acid levels. Cell, 185(22), 4099-4116.

      McBride, C. S., Baier, F., Omondi, A. B., Spitzer, S. A., Lutomiah, J., Sang, R., ... & Vosshall, L. B. (2014). Evolution of mosquito preference for humans linked to an odorant receptor. Nature, 515(7526), 222-227.

      Wei, J. N., Vlot, M., Sanchez-Lengeling, B., Lee, B. K., Berning, L., Vos, M. W., ... & Dechering, K. J. (2022). A deep learning and digital archaeology approach for mosquito repellent discovery. bioRxiv, 2022-09.

      Comments on revisions:

      The revisions made to the manuscript do not fully address the concerns raised in the previous round of review. The authors are encouraged to consider the following points to strengthen the work.

      The benchmarking of the human perception models against Keller et al. (2017) and Gutiérrez et al. (2018) is insufficient, as the field has progressed considerably in the last five years with newer approaches using larger data sources. Benchmarking against more recent models would better situate the contribution of this work.

      The exclusion of human repellency data from preprint Boyle et al. (2016) is worth reconsidering. For a study that takes an explicitly human-centric modeling approach, human behavioral data on repellency, pleasantness, and usage intent would directly support the central claims of the manuscript.

      The key claims regarding repellency and consumer acceptability would be considerably strengthened by the addition of these data.

    3. Reviewer #2 (Public review):

      Summary:

      This is an interesting study that seeks to identify novel mosquito repellents that smell attractive to humans. This is the second time I have reviewed, and the authors have not done anything to address the weaknesses. Although the subject matter may provide important new information for the development of new repellents, its current breadth is limited without additional assays. Arm-in-cage assays, testing the longevity of the new repellents, other ML analyses and confusion matrices, would strengthen the manuscript and demonstrate innovation. The lack of cohesion and new experimental results weakens the manuscript.

      Strengths:

      The combination of standard machine learning methods with mosquito behavioral tests is a strength.

      Weaknesses:

      The study would be strengthened by describing how other modern ML approaches (RF, decision trees) would classify and identify other potential repellents.

      A comparison of the repellent activity between DEET and the top ten hits identified in this new study indicates little change in repellent activity (~3%), suggesting that DEET remains the gold standard. Without additional toxicity tests and longevity tests, the study is arguably incremental. The study's novelty should be better clarified.

      The Methods in the repellency tests are sparse, and more information would be useful. Testing the top repellents at low doses (<<1%) and for long periods (2-12 h) would strengthen the manuscript. Without this information, the manuscript is lacking in depth.

      Testing human subjects on their olfactory percept of the repellents would also increase the depth and utility of the manuscript. Without additional experiments, the authors' conclusions lack support and have limited impact on the state-of-the-art.

      This manuscript is a mix of different approaches, which makes it lack cohesion. There is the ML method for classifying new repellents that smell good, but no testing of the repellents on human volunteers. The repellents are not tested at realistic concentrations and durations. And the calcium mobilization test is strange, and makes little sense in the context of the other experiments and framing of the manuscript.

      Comments on revisions:

      The authors have a potentially strong manuscript. However, I would urge the authors to address the reviewer comments in a substantive manner.

    4. Author response:

      The following is the authors’ response to the original reviews.

      Mosquito-transmitted diseases cause nearly a million deaths every year and significant worldwide morbidity. Moreover, the geographical range of mosquito vectors is rapidly expanding due to climate change and mosquito-borne disease risks are emerging in new parts of the world.

      Innovation in finding new repellents has been slow due to limitations in current research approaches and high costs for EPA registration (especially for synthetic compounds). Since DEET was discovered in the 1940s only a handful of additional actives have been approved by the EPA for repellent products. In the 20+ years since discovery of insect odorant receptors from genomes, not a single novel repellent compound has been identified that was registered by the EPA. Thus, there is a both a strong need for new approaches to find insect repellents and need for new active ingredients that are safe and strategically effective.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      In this manuscript, the authors set up a pipeline to predict insect repellents that are pleasant and safe for humans. This is done by daisy-chaining a new classification model based on predicting repellents with a published model on predicting human perception. Models use a feature-engineered selection of chemical features to make their predictions. The predicted molecules are then validated against a proxy humanoid (heated brick) and its safety is tested by molecular assays of human cells. The humanistic approach to modeling these authors have taken (which considers cosmetic/aesthetic appeal and safety) is novel and a necessary step for consumer usage. However, the importance of pleasantness over effectiveness is still up for debate (DEET is unpleasant but still used often) and the generalization of safety tests is unknown and assumed. The effectiveness of the prediction models is also still warranted. They pass the authors' own behavioral tests, but their contribution to the field is unknown as both models (new and published) have not been rigorously benchmarked to previous models. Moreover, the author's breadth of literature in this field is sparse, ignoring directly related studies.

      Strengths:

      Humanistic approach to modeling considers pleasantness and safety. Chaining models can help limit the candidate odorants from the vastness of odor space.

      Weaknesses:

      The current models need to be bench-marked against leading models predicting similar outcomes. Similarly, many of these papers need to be addressed and discussed in the introduction. The authors might even consider their data sources for model training to increase performance and lexical categorization for interoperability. For instance, the Dravnikes data lexicon, currently used in the human perception lexicon, has been highly criticized for its overlapping and hard-to-interpret descriptive terms ("FRAGRANT", "AROMATIC"). 

      Human Perception:

      Khan, R. M., Luk, C. H., Flinker, A., Aggarwal, A., Lapid, H., Haddad, R., & Sobel, N. (2007). Predicting odor pleasantness from odorant structure: pleasantness as a reflection of the physical world. Journal of Neuroscience, 27(37), 10015-10023.

      Keller, A., Gerkin, R. C., Guan, Y., Dhurandhar, A., Turu, G., Szalai, B., ... & Meyer, P. (2017). Predicting human olfactory perception from chemical features of odor molecules. Science, 355(6327), 820-826.

      Gutiérrez, E. D., Dhurandhar, A., Keller, A., Meyer, P., & Cecchi, G. A. (2018). Predicting natural language descriptions of mono-molecular odorants. Nature communications, 9(1), 4979.

      Lee, B. K., Mayhew, E. J., Sanchez-Lengeling, B., Wei, J. N., Qian, W. W., Little, K. A., ... & Wiltschko, A. B. (2023). A principal odor map unifies diverse tasks in olfactory perception. Science, 381(6661), 999-1006.

      The human perception predictions were performed using models that we had reported in two earlier publications which we have now indicated clearly in the results and methods sections of the VOR: Kowalewski & Ray, iScience (2020b) and Kowalewski, Huynh & Ray, Chem. Senses (2021). Three of the four references pointed out by the referee were cited in these prior studies, which involved computational validation by predicting on a test set of the data which was left out of training (as typically done), and also predicting across different human studies with a high degree of success. A rigorous benchmarking of the odor perception models was done in Kowalewski, Huynh & Ray, Chem. Senses (2021) and a mini-review published in the same issue of the journal by Gerkin, Chem. Senses, (2021). This included a favorable comparison with the two references indicated by the referee: Keller et al. Science (2017) as well as the Gutiérrez et al. Nat. Communication (2018).

      The 4th reference, Lee et al, Science (2023) describes a neural network approach and was published well after our mosquito behavior studies were completed. Although using an advanced Neural network model Lee et al. worked with 2-D structures of compounds in contrast to our 3-D approach. They also did not report cross-study validations or comparisons with Keller et al, 2017 or benchmark to past studies, so it is difficult to compare advances if any. We have added this reference in the VOR.

      The intent of the current study was to move beyond testing approaches, of which there are many, and instead work on a practical use case. As we see it, it is not necessarily the prediction of fragrance character or quality alone that matters but overlap with other predicted bioactivities. From the perspective of human use, a molecule with a pleasing scent that also repels insects is likely to be far more useful than one with an unappealing scent. Accordingly, our task in this study was to select molecules that fit into specific use categories: display strong insect repellency, have pleasing scent profiles, are natural in origin and are potentially repurposed from flavors and fragrances.

      Insect Repellents:

      Wright, R. H. (1956). Physical basis of insect repellency. Nature, 178(4534), 638-638.

      Katritzky, A. R., Wang, Z., Slavov, S., Tsikolia, M., Dobchev, D., Akhmedov, N. G., ... & Linthicum, K. J. (2008). Synthesis and bioassay of improved mosquito repellents predicted from chemical structure. Proceedings of the National Academy of Sciences, 105(21), 7359-7364.

      Bernier, U. R., & Tsikolia, M. (2011). Development of Novel Repellents Using Structure− Activity Modeling of Compounds in the USDA Archival Database. In Recent Developments in Invertebrate Repellents (pp. 21-46). American Chemical Society.

      The Katritzky et al. PNAS (2008) paper is cited in our study, and we have indicated that the chemical analogs reported therein are part of the training data set in our study. We thank the reviewer for pointing us to the book chapter by Bernier & Tsikolia (2011), which reviews the QSAR approaches taken for repellent discovery and in large measure focuses on the Katritzky et al. PNAS (2008) paper. We did cite two relevant studies by Uli Bernier.

      The current study assumes that insect repellents repel via their odor valence to the insect, but this is not accurate. Insect repellents also mask the body odor of humans making them hard to locate. The authors need to consult the literature to understand the localization and landing mechanisms of insects to their hosts. Here, they will understand that heat alone is not the attractant as their behavioral assay would have you believe. I suggest the authors test other behaviour assays to show more convincing evidence of effectiveness. See the following studies:

      De Obaldia, M. E., Morita, T., Dedmon, L. C., Boehmler, D. J., Jiang, C. S., Zeledon, E. V., ... & Vosshall, L. B. (2022). Differential mosquito attraction to humans is associated with skin-derived carboxylic acid levels. Cell, 185(22), 4099-4116.

      McBride, C. S., Baier, F., Omondi, A. B., Spitzer, S. A., Lutomiah, J., Sang, R., ... & Vosshall, L. B. (2014). Evolution of mosquito preference for humans linked to an odorant receptor. Nature, 515(7526), 222-227.

      Wei, J. N., Vlot, M., Sanchez-Lengeling, B., Lee, B. K., Berning, L., Vos, M. W., ... & Dechering, K. J. (2022). A deep learning and digital archaeology approach for mosquito repellent discovery. bioRxiv, 2022-09.

      In this study we took an unbiased approach to compile the training data set, including several known insect repellents of varying chemical structures and volatility, for most of which there is no information on how they are sensed by insects. Not surprisingly, the repellents we identified are varied in structure and in functional groups, and are likely detected in more than one way by the mosquitoes, using olfactory and/or gustatory systems. We did not consider “masking” of skin attraction as a factor in the training data set in this study, which precluded the need to discuss the papers pointed out by the referee. In fact there is an extremely vast and rich body of literature regarding human skin odor, CO<sub>2</sub> and breath emanations, which includes our own contributions of research, and review articles that are not discussed in the current paper.

      We did in fact conduct human arm-in-cage experiments with a few of the compounds reported in this study using female Aedes aegypti mosquitoes; a preprint describes the smaller scale analysis, the results of which show very strong repellency, in Boyle et al. bioRxiv (2016) https://doi.org/10.1101/060178 (Figure 4). That line of experimentation falls outside the scope of this current study and are being pursued in a separate form. We have added the citation for this preprint in the results section of the VOR.

      However, heat with CO<sub>2</sub> as used in this study offers a practical proxy for evaluating prospective repellents in a high-throughput manner. It would certainly be desirable to further evaluate additional candidates from the heat attraction assay with human subjects in the future.

      We thank the reviewer for pointing out the preprint by Wei, et al. bioRxiv (2022). Our approaches differ in that Wei et al do not consider properties such as fragrance and toxicity. We also cannot assume that their newer neural network model is superior because although the model uses a large training dataset, it does not use 3D chemical structures that are extremely relevant for biological activity. While very little information is available for the actives reported in Wei et al., we independently evaluated their top compounds similar or better than DEET (CAS#3731-16-6, 4282-32-0, 2040-04-2, 32940-15-1 and 3446-90-0) and could not find information about toxicity, smell, or natural source. In contrast, the top repellents that we identify here as similar or better than DEET (N=8) are all classified as GRAS (Generally Regarded as Safe) compounds by the Flavor and Extract Manufacturers (FEMA), are all naturally occurring (plum, jasmine, mushroom, grapes, etc), and have pleasant smells. The Dermal toxicity values in rabbits are known for six of our compounds and are at the best possible levels (≥5000mg/kg).

      Reviewer #2 (Public Review):

      Summary:

      This is an interesting study that seeks to identify novel mosquito repellents that smell attractive to humans.

      Strengths:

      The combination of standard machine learning methods with mosquito behavioral tests is a strength.

      Weaknesses:

      The study would be strengthened by describing how other modern ML approaches (RF, decision trees) would classify and identify other potential repellents.

      The current approach already shows a success rate >85% for repellency coefficient >0.5 and identifies eight naturally occurring GRAS compounds with repellency as strong as or greater than DEET. This substantially expands the repertoire of strong natural repellents. Since the 1950s only six active ingredients have been registered by US EPA for use in topical repellents, of which only two are natural in origin (Oil of lemon eucalyptus and catmint oil) and they typically do not protect as well as DEET does. That being said, we have since explored other predictive algorithms, for instance Neural Networks. The experimental evaluation of these newer pipelines will take significant resources and time and will be the focus of future grants.

      A comparison in the repellent activity between DEET and the top ten hits identified in this new study indicates little change in repellent activity (~3%), suggesting that DEET remains the gold standard. Without additional toxicity tests, the study is arguably incremental. The study's novelty should be better clarified.

      There is an urgent need to find new insect repellents that have better chances of being adopted by people who avoid DEET, such as in Africa and Asia. Having more natural actives that are effective, expands the tools against disease transmitting mosquitoes. As mentioned above, the top repellents that we identified as similar to or better than DEET (N=8) are all classified as GRAS (Generally Regarded as Safe) compounds by the Flavor and Extract Manufacturers (FEMA), are all naturally occurring (plum, jasmin, mushroom, grapes), and have pleasant smells. The Dermal toxicity values in rabbits are known for six and they are of the best possible levels (≥5000mg/kg).

      The Methods in the repellency tests are sparse, and more information would be useful. Testing the top repellents at low doses (<<1%) and for long periods (2-12 h) would strengthen the manuscript. Without this information, the manuscript is lacking in depth.

      The US Environmental Protection Agency (EPA) regulates mosquito repellents, and DEET-based commercial products are typically assigned protection times that vary with concentration (10% ~2 hrs, 30% ~5hrs, 100% ~8hrs). These would be the relevant concentrations for testing protection times on human volunteers, not lower as suggested. Such studies fall within the realm of EPA registration efforts, involving extensive GLP-testing for safety, physical chemistry, and Human Subjects Board approvals. This is outside the scope of the current study and is typically accomplished during development efforts.

      Testing human subjects on their olfactory perceptions of the repellents would also increase the depth and utility of the manuscript. Without additional experiments, the authors' conclusions lack support and have limited impact on the state-of-the-art.

      This manuscript is a mix of different approaches, which makes it lack cohesion. There is the ML method for classifying new repellents that smell good, but no testing of the repellents on human volunteers. The repellents are not tested at realistic concentrations and durations. And the calcium mobilization test is strange and makes little sense in the context of the other experiments and framing of the manuscript.

      The human olfaction validation that we present in this paper is consistent with most current publications in the field (for example, Keller et al, Gutiérrez et al.). More systematic validation of the human odor character prediction pipelines used was presented in two previous papers Kowalewski & Ray, iScience (2020b) and Kowalewski, Huynh & Ray, Chem. Senses (2021) and a mini-review published in the same issue of the journal by Gerkin, Chem. Senses, (2021).

      Reviewer #3 (Public Review):

      While I am not a specialist in this field, I do have some knowledge of the subject matter and the computational aspects involved. The authors employ simple machine learning techniques (such as SVM) for the following purposes:

      (a) Prediction of aversive valence.

      (b) Predicting anti-repellent chemicals.

      (c) Predicting calcium mobilization.

      The approach is commonplace in chemoinformatics literature.

      Weaknesses:

      All the above models are presented discretely, making it difficult to discern experiment design principles and connectedness.

      The ML work is rudimentary, lacking adequate details. Chemoinformatics has reached great heights, and SVM does not seem contemporary.

      There is significant existing research on finding repellents.

      In the current study, we aimed to showcase how computational research may be combined with basic science to create scalable pipelines that address real world problems, rather than to demonstrate methodological novelty of chemoinformatics approaches. Specifically we wanted to use different predictive models to identify compounds that display strong insect repellency, have pleasing scent profiles, are natural in origin and are potentially repurposed from flavors and fragrances. Unfortunately, there is very little existing research on insect repellents that have these types of properties, which would make them better candidates for EPA registration. Most tested compounds are synthetic, and are often analogs of known repellents like DEET, and necessitate substantial time and resources to register. Moreover the identities of chemosensory receptors that are responsible for repellency to DEET and other compounds, and that are conserved across Anopheles, Aedes and Culex mosquitoes are not known.

      It is true that the field of cheminformatics has experimented with a variety of newer approaches, based in part on neural networks (e.g., Graph Neural Networks and graph embeddings to encode chemical structure rather than a more conventional Extended Connectivity Fingerprint (ECFP)). Importantly, however, novelty does not imply usefulness. The mosquito behavior experiments that we present show a very high success rate (>85%), validating our approach and identifying several excellent candidates already.

      Strengths:

      Authors attempt to make a case for calcium mobilization in the context of repellency. This aspect sounds interesting but is not surprising.

      Behavioral profiling of repellents could be useful.

      We thank the referee for this comment. We have indeed done behavioral profiling for several repellents that evoke calcium mobilization, but we do not see any clear correlation thus far.

    1. eLife Assessment

      This manuscript proposes a valuable idea on how cortical networks may learn a helpful representation of sensory stimuli. The model implementing this idea is tested in multiple experimental paradigms. However, the evidence remains incomplete as to whether the method supports both invariance and equivariance and whether it can estimate the dynamics of the moving object.

    2. Reviewer #1 (Public review):

      Summary:

      The paper describes a biologically plausible version of JEPA using recurrent neural networks called RPL for recurrent predictive learning. Given an embedding z_t, a recurrent neural network processes these inputs with the form: c_t+1 = RNN(c_t, z_t). Then the predictive network f is predicting the future inputs with the format: min || f(c_t) - stop_grad(z_t+delta t) ||^2. I understand that a prediction error is defined as: e = z_t+delta t - f(c_t) to model cortical measurements in the oddball task.

      The RPL model is also shown to build an internal world model, with "real-world" data like the movement of moving animals or speech signals. The representation is then compared to V1 data and expected prediction error signals in an oddball setting. In a stacked hierarchy of RNN learning with RPL, the higher layers appear to learn high-level latent variables, although gradients are not propagated downward to the lower layers.

      Strengths:

      (1) The paper tackles an open question: Self-supervised learning is thought to be a fundamental principle to explain how computation is structured in the brain. Cortical data suggest qualitatively that prediction error is a core principle of representation learning in the brain, but the field is still looking for a simple yet expressive model that would explain how the cortex learns its representations. RPL contributes in that direction by making a useful link between cortical representation learning in RNN models and the JEPA learning algorithm that was demonstrated to scale to large world model learning from video data by Lecun's group. It is very useful to connect this popular deep learning algorithm to cortical data.

      (2) The model formalism is relatively elegant and simple: Simple next input prediction objectives are conceptually simple but not necessarily trivial to build at scale. There is a clear benefit in comparison with contrastive or IL methods because they are free from dataset-specific data augmentation and negative samples. Thereby moving the comp neuro field towards conceptually simpler models of representation in the cortex. Yet predictive only models (and in particular predictive models in latent space instead of pixel space) are not easy to build in a stable fashion. JEPA family is basically intended to solve this question; it is very nice and timely to bring this to comp neuro.

      (3) The methodology combining comp neuro and deep learning makes sense: The conceptual and qualitative analogy with cortical prediction errors is relevant and consistent with what is expected as a model of self-supervised learning in cortical models. The methodology to compare RPL with IL and CL is methodologically meaningful and grounded: showing, for instance, how some of the models fail to represent some latent structure in some toy datasets is interesting.

      (4) h-RPL: The h-RPL is perhaps the most creative departure from the JEPA model family. It would be interesting to say more about what was particularly difficult to see in the latent variables emerging in the hierarchical model. I often find it magical that layer-wise learning rules of this type are not learning redundant representations. Any insights why this is not the case here would be potentially insightful.

      Weaknesses:

      In general, I fully support the type of question and ideas that the paper is putting forward. It is, however, very hard in this research field to gain insight into specific conceptual contributions or specific bits of experimental data that the model puts forward. In pointing to the following weaknesses, I am encouraging the authors to lay out more clearly what the unique hypothesis is or the contribution of the RPL model that we should remember it for.

      (1) The devil is in the details:

      1a) Comparison with JEPA variants: JEPA variants are integrating different details into the learning algorithm. Integrating, for instance, "masking" of the latent encoder targets, or EMA in the style of BYOL or Siamese networks, for the predicted representations. It is great that RPL does not seem to need any of those (next input prediction is a natural implementation of masking, and EMA does not seem to be used). It is notoriously hard for the JEPA model to work without these features. Since some of these details are sometimes surprisingly crucial for a simulation to work, it would be good to report which of the other important details were key to live without EMA and masking. Is it the difference in learning rate, for instance? Or maybe the tasks considered are simply easy enough for any model to work; if so, it could be useful to acknowledge to what extent this is true.

      1b) Comparison with IL and CL: On a high level, the comparison with IL and CL algorithms is written as conclusive. I suspect that the failure modes of IL and CL that are described are not due to the algorithms themselves, but rather to the construction of invariance statistics or the choice of negative sample sets (the sets of samples among which variance 1 is requested by VICreg). For instance, if variance (or negative sample set) is taken only across time, the variance object identity is expected to collapse. Similarly, if the variance is taken across the object identity, the variance across time can collapse. So I wonder if the failure of IL and CL is induced by the construction of the variance definition.

      (2) Prediction error: When compared to the recording of cortical activity in Figure 7. It is not obvious from the figure which latent space we are talking about mathematically. Is the vector z, c or the prediction error e? This is rather important from a neuroscientific point of view, because the prediction error e is expected to explain the neuronal data. On the other hand, the prediction error e is only used in the learning algorithm to define the loss function, but it is not the communication medium between the RNN units c (or with the encoder z).

      In the brain, since the measurements are recorded as neural activity, they are communication channels between specific units (z or c). It is probably c or z that would already explain the oddball prediction error. I believe that other models, like Forward-forward of Nejad et al., have tried quite hard to address this apparent tension. Whether or not this is resolved by RPL, it thinks it would be beneficial to state the problem and clarify how the algorithm addresses or ignores the issue.

      (3) Successor representation without value? I believe the term successor representation is historically relevant in a reinforcement learning (RL) setting and has a precise mathematical definition. Without RL, I feel that learning successor representation is conceptually identical to learning a transition matrix (aka, a primitive world model). I therefore wonder if the pitch for high-level framing of the successor representation is appropriately described or trivial.

      (4) Learning in RNN: Learning with recurrent networks appears to be a key in this model presented here (it is in the algorithm name). Yet, this aspect of the model and the literature on biologically plausible learning rules for RNN is not really discussed.

    3. Reviewer #2 (Public review):

      This is a very interesting manuscript, which proposes a novel idea on how cortical networks may learn useful representations of sensory stimuli. The model implementing this idea is thoroughly tested in multiple experimental paradigms. The manuscript is very clearly written. I feel it may have a significant impact on our understanding of cortical circuitry.

    4. Reviewer #3 (Public review):

      Summary:

      This paper presents Recurrent Predictive Learning (RPL), a self-supervised model conceptually similar to Joint-Embedding Predictive Architecture (JEPA) models. RPL sequentially observes dynamic scenes to predict subsequent observations. A central claim of the work is that the model's trained representations are simultaneously invariant and equivariant to transformations, such as movement properties that emerge without explicit supervision. These representational qualities are demonstrated through three experiments utilizing two simulated datasets and one naturalistic dataset. Furthermore, the latent embeddings are qualitatively compared with neural data, showing that the model reproduces the successor representation observed in human V1 and the local/global oddball effect in the monkey Prefrontal Cortex.

      Strengths:

      (1) The paper addresses a fundamental question relevant to both computational neuroscience and machine vision: how the brain learns representations that are simultaneously invariant and equivariant to transformations. The manuscript is well-written, easy to follow, and supported by clear visualizations.

      (2) While JEPA-style models have recently gained significant traction in the artificial intelligence community, this paper nicely bridges the gap to neuroscience. By framing these architectures as a theory for visual learning in the brain, the authors provide valuable insights into how predictive frameworks can explain cortical processing.

      (3) The qualitative alignment with V1 and PFC data is a particularly strong contribution, as it offers a potential mechanistic explanation for observed neural phenomena through the lens of self-supervised learning.

      Weaknesses:

      (1) The central claim, that both invariance and equivariance emerge spontaneously, requires further scrutiny (see Ghaemi et al., NeurIPS, 2025; Garrido et al., arXive, 2024). In particular, the synthetic "moving animal" dataset used in this paper may be too simple to fully support this claim. In latent space prediction, a model must predict both the scene content and the dynamics of movement. Because movement (whether ego-motion or external) is often highly uncertain (or multi-modal), predictive models in naturalistic settings often "collapse" toward learning purely invariant representations, ignoring the hard-to-predict dynamics. In the provided simulations, the movements are extremely predictable. In more complex scenarios, the model would likely prioritize content (invariance) over dynamics (equivariance) unless aided by action-conditioning or explicit factor estimation (Zhang et al., ICLR, 2026). The authors' results in Figure 5 using naturalistic video seem to reflect this limitation, given the lower performance on the naturalistic videos compared to the synthetic datasets.

      (2) The framing of the RPL model as an entirely new theory of representation learning is slightly overstated. The focus on prediction in representation space rather than input space is the defining characteristic of JEPA and various other Self-Supervised Learning (SSL) models, even sequential prediction. While this paper clarifies the connection between these AI frameworks and cortical circuits, the work would be strengthened by more explicitly positioning RPL within the context of existing JEPA-style models and prior SSL theories of the visual system.

      (3) A significant challenge in latent-space SSL is avoiding "representational collapse" (where the model provides a trivial constant output). While the paper alludes to JEPA-like solutions, it lacks a detailed explanation (in both the text and the architectural schematics) of the specific technique used to prevent collapse. Consequently, it is difficult to evaluate the authors' claim of "biological plausibility," as the biological equivalents of common machine learning techniques (such as stop gradient) are not discussed.

      (4) Recent work has shown that the capacity (size) of the predictor significantly influences the learned representations in a JEPA-type world model (Gorrido et al., 2024). In simpler scenarios, a large enough predictor can allow a model to "memorize" dynamics rather than learning generalized equivariant features. It would be beneficial to see how the ratio of predictor size to encoder size affects the emergence of these features.

      Methodological Clarifications:

      (1) The authors mention a contrastive learning comparison but provide few details. Since contrastive learning is primarily a technique to avoid collapse, it would be a more rigorous baseline if implemented within the same architecture as RPL to isolate the effect of the predictive objective.

      (2) In the PFC data comparison (Figure 7f), there appears to be a discrepancy where the local and global conditions show nearly identical results in PFC, while different dynamics in the model. It is unclear if this is a visualization error or a genuine model deviation.

      (3) The criteria for selecting specific model variables for comparison with V1 versus PFC are not explicitly defined. Clarification is needed on whether the same latent variables were used for both brain regions or if different layers were selected.

    5. Author response:

      Public Reviews:

      Reviewer #1 (Public review):

      The paper describes a biologically plausible version of JEPA using recurrent neural networks called RPL for recurrent predictive learning. Given an embedding z<sub>t</sub>, a recurrent neural network processes these inputs with the form: c<sub>t</sub>+1 = RNN(c<sub>t</sub>,z<sub>t</sub>). Then the predictive network f is predicting the future inputs with the format: min||f(c<sub>t</sub>) − stop grad(z<sub>t</sub>+∆<sub>t</sub>)||<sup>2</sup>. I understand that a prediction error is defined as: e = z<sub>t</sub>+∆<sub>t</sub> − f(c<sub>t</sub>) to model cortical measurements in the oddball task.

      The RPL model is also shown to build an internal world model, with ”real-world” data like the movement of moving animals or speech signals. The representation is then compared to V1 data and expected prediction error signals in an oddball setting. In a stacked hierarchy of RNN learning with RPL, the higher layers appear to learn high-level latent variables, although gradients are not propagated downward to the lower layers.

      The paper tackles an open question: Self-supervised learning is thought to be a fundamental principle to explain how computation is structured in the brain. Cortical data suggest qualitatively that prediction error is a core principle of representation learning in the brain, but the field is still looking for a simple yet expressive model that would explain how the cortex learns its representations. RPL contributes in that direction by making a useful link between cortical representation learning in RNN models and the JEPA learning algorithm that was demonstrated to scale to large world model learning from video data by Lecun’s group. It is very useful to connect this popular deep learning algorithm to cortical data.

      The model formalism is relatively elegant and simple: Simple next input prediction objectives are conceptually simple but not necessarily trivial to build at scale. There is a clear benefit in comparison with contrastive or IL methods because they are free from dataset-specific data augmentation and negative samples. Thereby moving the comp neuro field towards conceptually simpler models of representation in the cortex. Yet predictive only models (and in particular predictive models in latent space instead of pixel space) are not easy to build in a stable fashion. JEPA family is basically intended to solve this question; it is very nice and timely to bring this to comp neuro.

      The methodology combining comp neuro and deep learning makes sense: The conceptual and qualitative analogy with cortical prediction errors is relevant and consistent with what is expected as a model of self-supervised learning in cortical models. The methodology to compare RPL with IL and CL is methodologically meaningful and grounded: showing, for instance, how some of the models fail to represent some latent structure in some toy datasets is interesting.

      (1.1) h-RPL: The h-RPL is perhaps the most creative departure from the JEPA model family. It would be interesting to say more about what was particularly difficult to see in the latent variables emerging in the hierarchical model. I often find it magical that layer-wise learning rules of this type are not learning redundant representations. Any insights why this is not the case here would be potentially insightful.

      We thank the reviewer for this comment. Regarding representational collapse in h-RPL: each local circuit independently applies the same collapse-preventing strategy as the single-level RPL model: namely, the asymmetric prediction architecture combined with the stop-grad operator. Since this mechanism operates locally within each circuit, it is sufficient to prevent collapse at every level of the hierarchy independently (see also our response to Point P1.3).

      The more subtle question is why the circuits learn non-redundant rather than identical representations across the hierarchy. We believe two mechanisms are at play here: First, the hierarchical encoder is a stacked convolutional network, meaning that receptive field sizes grow with depth. This architectural inductive bias naturally encourages successive circuits to operate on increasingly spatially integrated features, creating a structural pressure toward learning complementary rather than redundant representations. Second, the growing expressivity of the network with depth means that higher circuits have access to richer, more abstract inputs from which they can extract higher-level latent structure that is not already captured by lower circuits. Together these factors: the local collapse-preventing mechanism and the depth-dependent growth in receptive field size and network expressivity presumably explain why h-RPL builds an increasingly refined and non-redundant representational hierarchy.

      What we will do: We will expand our discussion on this point in the revised manuscript. We plan to expand our quantification on how abstractions emerge in h-RPL in future work in which we will also study variations with top-down connections.

      (1.2) In general, I fully support the type of question and ideas that the paper is putting forward. It is, however, very hard in this research field to gain insight into specific conceptual contributions or specific bits of experimental data that the model puts forward. In pointing to the following weaknesses, I am encouraging the authors to lay out more clearly what the unique hypothesis is or the contribution of the RPL model that we should remember it for.

      Thanks for the positive feedback along with the constructive criticism, and we agree that articulating the core contributions more crisply would strengthen the paper.

      At its heart, we believe the paper makes two contributions we hope it will be remembered for. First, while prior work has established that invariant representations can be learned via local Hebbianlike learning rules, we show that learning equivariant representations alongside a latent dynamics model requires something qualitatively different: a local circuit; one with recurrent dynamics and an asymmetric predictive architecture. RPL provides a minimal concrete instantiation of this principle.

      Second, and perhaps more broadly, the model makes a structural prediction about (cortical) neuronal circuit organization: since the encoder, integrator, and predictor each perform functionally distinct computations, the framework implies the existence of corresponding cell types and connectivity patterns one should look for in experimental data.

      What we will do: We will sharpen these above messages in the revised manuscript to ensure these contributions are prominently highlighted throughout the paper.

      (1.3) Comparison with JEPA variants: JEPA variants are integrating different details into the learning algorithm. Integrating, for instance, “masking” of the latent encoder targets, or EMA in the style of BYOL or Siamese networks, for the predicted representations. It is great that RPL does not seem to need any of those (next input prediction is a natural implementation of masking, and EMA does not seem to be used). It is notoriously hard for the JEPA model to work without these features. Since some of these details are sometimes surprisingly crucial for a simulation to work, it would be good to report which of the other important details were key to live without EMA and masking. Is it the difference in learning rate, for instance? Or maybe the tasks considered are simply easy enough for any model to work; if so, it could be useful to acknowledge to what extent this is true.

      We thank the reviewer for raising this important point. There are two key mechanisms that ensure stable, non-trivial training in RPL. First, using a higher learning rate for the predictor relative to the encoder is crucial for stable training. This prevents the predictor from collapsing the encoder representations and was already noted empirically by Chen et al. (2021).

      Second, and more fundamentally, predicting at the level of the memoryless encoder output, rather than at the level of the recurrent integrator, is essential to prevent a degenerate solution in which the RNN simply learns to generate an internally predictable time series unrelated to the input. By anchoring the prediction target to the encoder, the model is forced to ground its representations in the sensory input. Intuitively, otherwise the RNN can simply “make up” a predictable time series, which satisfies the learning objective, but would not yield useful internal representations.

      Beyond these architectural points, previous work from our group (Srinath Halvagal et al., 2023) has shown mathematically that JEPAs without EMA avoid collapse via an implicit variance regularization mechanism, and we believe RPL benefits from the same principle. Indeed, we now have a more complete theoretical understanding of this, including identifiability proofs for the latent dynamical model under relatively mild assumptions (Mikulasch et al., 2026). This work has recently been accepted at ICML. Other than that, one has to ensure that representations are not already nearly collapsed at the beginning of training. In this paper, we used normalization layers (batchnorm) in the encoder to ensure this.

      Finally like all SSL paradigms the augmentation strength is an important hyperparameter that impacts the quality of learned representations. In the temporal predictive setting, the augmentation strength is fixed by the world itself. The only knob we have to play with is the prediction horizon ∆. While we typically focused on next-time-step (∆ = 1) prediction, we saw a clear effect in the case of the speech dataset where ∆ = 8, but not ∆ = 1, yielded useful representations for the tasks (Fig. 5b).

      What we will do: We will discuss the above points more prominently in the discussion to avoid them being overlooked in the methods. Additionally, we will include a plot on the empirical prediction horizon for the speech dataset in the supplementary material for reference.

      (1.4) Comparison with IL and CL: On a high level, the comparison with IL and CL algorithms is written as conclusive. I suspect that the failure modes of IL and CL that are described are not due to the algorithms themselves, but rather to the construction of invariance statistics or the choice of negative sample sets (the sets of samples among which variance 1 is requested by VICreg). For instance, if variance (or negative sample set) is taken only across time, the variance object identity is expected to collapse. Similarly, if the variance is taken across the object identity, the variance across time can collapse. So I wonder if the failure of IL and CL is induced by the construction of the variance definition.

      We thank the reviewer for this thoughtful point. Both RPL and CL implement an implicit variance regularizer by virtue of being JEPAs (Srinath Halvagal et al., 2023), whereas IL uses an explicit regularizer computed along both the batch and time dimensions to avoid representational and dimensional collapse. The failure modes of IL and CL therefore cannot be entirely attributed to the statistics of the input samples chosen for variance regularization, but are instead primarily determined by the choice of prediction and target representations.

      What we will do: We will clarify this in the Methods section of the revised manuscript.

      (1.5) Prediction error: When compared to the recording of cortical activity in Figure 7. It is not obvious from the figure which latent space we are talking about mathematically. Is the vector z, c or the prediction error e? This is rather important from a neuroscientific point of view, because the prediction error e is expected to explain the neuronal data. On the other hand, the prediction error e is only used in the learning algorithm to define the loss function, but it is not the communication medium between the RNN units c (or with the encoder z).

      In the brain, since the measurements are recorded as neural activity, they are communication channels between specific units (z or c). It is probably c or z that would already explain the oddball prediction error. I believe that other models, like Forward-forward of Nejad et al., have tried quite hard to address this apparent tension. Whether or not this is resolved by RPL, it thinks it would be beneficial to state the problem and clarify how the algorithm addresses or ignores the issue.

      Thanks for pointing out the issue with regards to clarity and for raising the important but subtle point about prediction error representation. To answer the immediate question asking which vector we use in Figure 7, it is the vector c corresponding to the integrator representations. We agree this should be stated explicitly and will update the manuscript accordingly.

      On the more general point, we agree that the tension between recordable neural activity and the computational role of prediction errors is an important issue. We do already briefly engage with it in the Discussion (subsection “Relation to previous modeling work”), where we note that under RPL “inter-areal communication is dominated by representations rather than error signals”. However, we agree that this point should be surfaced more directly.

      To elaborate, under classical predictive coding, prediction errors are the inter-areal communication channel and are therefore expected to be directly observable in neural recordings, e.g., as oddball responses. Under RPL, this is not the case: e is computed locally within a circuit and serves only as a learning signal for synaptic plasticity, not as a signal propagated between circuits or areas. What cortex primarily encodes and communicates in our framework are predictive representations, not reconstruction errors. Accordingly, what should map onto recorded population activity are the representations c (and z), while locally computed prediction errors could in principle remain observable as more circumscribed or transient mismatch-like signals within a circuit.

      We would like to push this point further. The reviewer frames this as a tension that RPL needs to resolve, but growing neurophysiological evidence suggests that classical residual-difference prediction errors may not be a dominant mode of cortical encoding in the first place. Furutachi, Franklin, et al. (2024) showed that V1 responses to unexpected visual stimuli do not encode how input deviates from predictions, but instead selectively amplify the representation of the unexpected stimulus itself. Very recently, Furutachi and Hofer (2026) generalize this into a revised framework in which feedforward pathways transmit sensory representations modulated by prediction-error magnitude, rather than residual differences. Vasilevskaya et al. (2026) constrain the space of plausible cortical algorithms via functionalinfluence experiments, also concluding that no variant of standard predictive processing is consistent with the full pattern of layer 2/3 ↔ layer 5 interactions; they propose a JEPA-based model, citing RPL as a promising candidate. The model by Nejad et al. (2025) similarly shares with RPL the property that representations, rather than residual errors, propagate between circuit elements.

      Taken together, the apparent tension may be less a problem RPL needs to resolve than one it is well positioned to explain, remaining consistent with the emerging picture of cortex as encoding amplified sensory features rather than transmitting residual errors across areas.

      What we will do: We will add missing information to the main text and sharpen the Discussion with these arguments.

      (1.6) Successor representation without value? I believe the term successor representation is historically relevant in a reinforcement learning (RL) setting and has a precise mathematical definition. Without RL, I feel that learning successor representation is conceptually identical to learning a transition matrix (aka, a primitive world model). I therefore wonder if the pitch for high-level framing of the successor representation is appropriately described or trivial.

      The reviewer makes a valid point on the concept of successor representations. To answer the immediate question, it is not entirely trivial, as we not only observe the emergence of the transition structure (Fig. 6c), but also the encoding of decaying future (but not past) state occupancy (Fig 6d,e). We largely adapted the terminology “successor-like representations” from the study by (Ekman et al., 2023), but we will elaborate a bit further for why we stuck to it. As nicely pointed out by the reviewer, the term “successor representations” was introduced in the RL literature (Dayan, 1993), but further adopted in neuroscience to describe the idea that a neuronal population encodes a predictive representation that reflects the expected future occupancy of future states under a given policy. Ekman et al. (2023) use the term “successor-like representations” to explain the phenomena where the neural activity in V1 (and hippocampus) represent both current and (discounted) future, but not past, state occupancies in a sequence learning task with no explicitly defined policy or value training. In other words, successor-like representations are simply predictive representations.

      What we will do: To deal with this dichotomy, we will replace “successor-like representations” with the term “predictive representations” in the abstract and clarify this distinction in the Results section of the revised manuscript.

      (1.7) Learning in RNN: Learning with recurrent networks appears to be a key in this model presented here (it is in the algorithm name). Yet, this aspect of the model and the literature on biologically plausible learning rules for RNN is not really discussed.

      We thank the reviewer for raising this concern. While h-RPL is one step toward more biologically plausible and spatially local learning rules, exploring it further in terms of temporal credit assignment is beyond the scope of the present study and would require a more systematic and in-depth analysis. However, moving toward more biologically plausible learning rules is an interesting research direction that we plan to explore, as we also mentioned in the Discussion (“Limitations and future research directions”).

      We think a viable strategy could be to combine a slim spatial credit assignment strategy such as feedback alignment (Nøkland, 2016; Lillicrap et al., 2016) with an online learning rule using eligibility traces for temporal credit assignment such as SuperSpike (Zenke et al., 2018) or e-prop (Bellec et al., 2020). Similar strategies have given promising results for CLAPP (Illing et al., 2021; Zihan et al., 2026).

      What we will do: Following the suggestion, we will discuss biologically plausible learning rules for RNNs in the Discussion.

      Reviewer #2 (Public review):

      This is a very interesting manuscript, which proposes a novel idea on how cortical networks may learn useful representations of sensory stimuli. The model implementing this idea is thoroughly tested in multiple experimental paradigms. The manuscript is very clearly written. I feel it may have a significant impact on our understanding of cortical circuitry.

      Reviewer #3 (Public review):

      This paper presents Recurrent Predictive Learning (RPL), a self-supervised model conceptually similar to Joint-Embedding Predictive Architecture (JEPA) models. RPL sequentially observes dynamic scenes to predict subsequent observations. A central claim of the work is that the model’s trained representations are simultaneously invariant and equivariant to transformations, such as movement properties that emerge without explicit supervision. These representational qualities are demonstrated through three experiments utilizing two simulated datasets and one naturalistic dataset. Furthermore, the latent embeddings are qualitatively compared with neural data, showing that the model reproduces the successor representation observed in human V1 and the local/global oddball effect in the monkey Prefrontal Cortex.

      The paper addresses a fundamental question relevant to both computational neuroscience and machine vision: how the brain learns representations that are simultaneously invariant and equivariant to transformations. The manuscript is well-written, easy to follow, and supported by clear visualizations.

      While JEPA-style models have recently gained significant traction in the artificial intelligence community, this paper nicely bridges the gap to neuroscience. By framing these architectures as a theory for visual learning in the brain, the authors provide valuable insights into how predictive frameworks can explain cortical processing.

      The qualitative alignment with V1 and PFC data is a particularly strong contribution, as it offers a potential mechanistic explanation for observed neural phenomena through the lens of selfsupervised learning.

      (3.1) The central claim, that both invariance and equivariance emerge spontaneously, requires further scrutiny (see Ghaemi et al., NeurIPS, 2025; Garrido et al., arXive, 2024). In particular, the synthetic ”moving animal” dataset used in this paper may be too simple to fully support this claim. In latent space prediction, a model must predict both the scene content and the dynamics of movement. Because movement (whether ego-motion or external) is often highly uncertain (or multi-modal), predictive models in naturalistic settings often ”collapse” toward learning purely invariant representations, ignoring the hard-to-predict dynamics. In the provided simulations, the movements are extremely predictable. In more complex scenarios, the model would likely prioritize content (invariance) over dynamics (equivariance) unless aided by action-conditioning or explicit factor estimation (Zhang et al., ICLR, 2026). The authors’ results in Figure 5 using naturalistic video seem to reflect this limitation, given the lower performance on the naturalistic videos compared to the synthetic datasets.

      We thank the reviewer for the feedback. We agree that further validation on more complex datasets would strengthen the claims, and we take this point seriously. If the reviewer has any suggestions for a specific alternative dataset, we would welcome any recommendations.

      Regarding the mouse video data specifically, we realized that this is a suboptimal benchmark rather than a shortcoming of our method. The culprit presumably is that the mice remain largely stationary, leading to a heavily imbalanced velocity distribution peaked near zero (Supplementary Fig. S9). This imbalance makes equivariance evaluation unreliable regardless of the learning algorithm. For example, end-to-end supervised training results in an R<sup>2</sup> of 0.19 compared to 0.08 ± 0.02 for RPL.

      Regarding the moving animal dataset, we note that the dynamics are not trivial from an SSL perspective: unlike moving MNIST (Srivastava et al., 2015), the dataset includes changes in scale and orientation, both features that invariance-focused SSL models can easily ignore, yet RPL recovers reliably. For example, this discrepancy can be seen in Supplementary Table S1 where we compare to InfoNCE and CPC. That said, we acknowledge the reviewer’s broader concern and will seek to validate RPL on more complex datasets.

      While it would be nice to compare to related work by Ghaemi et al. (2024), this study used 3DIEBench (Garrido et al., 2023). Unfortunately, 3DIEBench’s reliance on pair-based representations with annotated but random augmentations (such as rotations or color changes) precludes the possibility of smooth latent traversals that would be required for RPL to learn from the same dataset. We will look into whether it is computationally feasible to adapt or regenerate a similar dataset that meets the requirements for temporal prediction.

      Regarding stochasticity, we agree that predictive learning in latent space is most natural in approximately deterministic settings, whereas real world sensory information often comprises non-deterministic elements. While a deeper treatment of such stochastic environments is beyond the scope of the present manuscript, it will be the focus of ongoing and future work. Regarding ongoing work, it is worth mentioning that in recent work from our group (Hauri et al., 2026), we have demonstrated that RPL’s core objective can replace the reconstruction loss in Dreamer, achieving competitive performance in complex, stochastic environments. While we did not systematically evaluate equivariance in this study, the results suggests that representation-space predictive learning is viable beyond the deterministic regime.

      What we will do: We will make the point about the real-world mouse video dataset being a poor benchmark and include the additional R<sup>2</sup> values to show that. Further, we will try to identify or generate alternative datasets to back the equivariance claims and discuss our findings in the light of previous work, e.g., Ghaemi et al. (2024). Moreover, we will sharpen our discussion of our model’s limitations in stochastic settings and highlight notable connections to related work.

      (3.2) The framing of the RPL model as an entirely new theory of representation learning is slightly overstated. The focus on prediction in representation space rather than input space is the defining characteristic of JEPA and various other Self-Supervised Learning (SSL) models, even sequential prediction. While this paper clarifies the connection between these AI frameworks and cortical circuits, the work would be strengthened by more explicitly positioning RPL within the context of existing JEPA-style models and prior SSL theories of the visual system.

      Thanks for raising this point. We are unsure what the reviewer refers to. We did not frame our work as ”an entirely new theory of representation learning,” as the reviewer suggests. In fact, we highlight quite the opposite already in the title of our article, which reads: “Understanding neural circuit principles for representation learning through joint-embedding predictive architectures.” We do not claim novelty over JEPA as an ML paradigm, we adopt it precisely because it provides a principled, non-generative framework for predictive representation learning, and our goal is to develop a circuit level instantiation that accounts for neural circuit computation. We already discuss a body of previous work of self-supervised learning and JEPAs at length. Since the reviewer did not specify what they are missing, we will briefly reiterate what is already there.

      Our contribution is a theory of representation learning in the brain, built on JEPAs as the underlying ML framework. The Title and Introduction already position our work quite explicitly this way. Specifically, we mention prior work on JEPAs (CPC, BYOL, SimSiam, I-JEPA, seq-JEPA, V-JEPA, V-JEPA 2), while noting that “most JEPAs developed in machine learning are poor models of cortical computation” because of their reliance on negative sampling, transformers, masking, static images, and/or known parametrized transformations, and motivate RPL as the minimal candidate that “must instead rely on recurrent neural dynamics, learn from streaming sensory input without masking, support both invariant and equivariant representations, and reproduce key neurophysiological observations.”

      The Discussion (“Relation to previous modeling work”) further details the specific novelties of RPL relative to existing sequential JEPA-style and SSL models like CPC (Oord et al., 2018), V-JEPA (Bardes et al., 2024), V-JEPA 2 (Assran et al., 2025), seq-JEPA (Ghaemi et al., 2024). In brief:

      RPL is a recurrent JEPA based on RNN dynamics, not transformers, and learns from streaming sensory input without masking or random negative sampling;

      It explicitly compares three prediction-error topologies (RPL vs. invariance learning vs. contextprediction; Fig. 2, Suppl. Fig. S2, S6) and shows that asymmetric recurrent prediction is essential for jointly learning invariant and equivariant representations;

      Importantly, it does so via pure temporal prediction without access to underlying transformations, a property shared by very few JEPAs. The closest exception is VJ-VCR (Drozdov et al., 2024) which uses an explicit variance-covariance regularization (VCReg) in a JEPA, which we will cite in the revised manuscript;

      It provides the first hierarchical JEPA optimizing local prediction errors at multiple levels (h-RPL, Fig. 8), as envisioned by LeCun (2022) but not previously implemented;

      It connects directly to neurophysiological data: successor-like representations in human V1 and abstract sequence representations in macaque PFC, which provides qualitative correspondence between JEPA components and cortical activity that the existing JEPA literature, focused on ML benchmarks, does not address.

      Finally, our article already includes a discussion paragraph on recent self-supervised learning models in the context of the brain where we discuss work by Nejad et al. (2025) and Asabuki et al. (2025). Most other SSL theories of the visual system rely on static images and recognition tasks (Yerxa et al., 2024; Margalit et al., 2024). However, there are two studies that include temporal prediction objectives and are worth mentioning with more details: First, Bakhtiari et al. (2021) show that representations similar to ventral and dorsal pathways in the visual system can emerge in a two-pathway encoder architecture within the CPC model. Second, Niu et al. (2024) use a “straightening” objective together with VCReg as a practical model of the perceptual straightening hypothesis (H´enaff et al., 2019). Though not a JEPA (i.e., has no predictor network), it can decode equivariant factors in a sequential MNIST dataset where only single factors change throughout a video.

      What we will do: We will carefully review our discussion of previous work and further discuss Drozdov et al. (2024), Bakhtiari et al. (2021), and Niu et al. (2024) in the revised manuscript.

      (3.3) A significant challenge in latent-space SSL is avoiding “representational collapse” (where the model provides a trivial constant output). While the paper alludes to JEPAlike solutions, it lacks a detailed explanation (in both the text and the architectural schematics) of the specific technique used to prevent collapse. Consequently, it is difficult to evaluate the authors’ claim of “biological plausibility,” as the biological equivalents of common machine learning techniques (such as stop gradient) are not discussed.

      Thanks for pointing this out. Our model avoids collapse through the asymmetric stop-grad / predictor architecture. It does not require an EMA, when the predictor learns with a faster learning rate than the rest of the network (see also our response to Point P1.3).

      The use of stop-grad suggests that a circuit learning with RPL needs to compute a vector-based instructive learning signal. While we do not explicitly model the circuit level mechanisms of how this could be implemented in the brain, excitation-inhibition balance is one possibility (Rossbroich et al., 2025). Finally, differences in learning rate can be implemented both structurally or functionally in the brain (see Liu et al. (2025) for instance), or activity normalization is suggested as a canonical computation in biological neural circuits (Carandini et al., 2012).

      What we will do: We will make sure to discuss these putative biological mechanisms in the revised manuscript.

      (3.4) Recent work has shown that the capacity (size) of the predictor significantly influences the learned representations in a JEPA-type world model (Gorrido et al., 2024). In simpler scenarios, a large enough predictor can allow a model to ”memorize” dynamics rather than learning generalized equivariant features. It would be beneficial to see how the ratio of predictor size to encoder size affects the emergence of these features.

      Thanks for raising this concern. We don’t observe noticeable difference in position and velocity decoding when changing the width or depth of the MLP predictor in the moving animals data. However, performance on rotation speed and orientation decoding scales with the changes in width, but not depth of the predictor. This analysis excludes the effect of integrator’s capacity as it directly affects the dimensionality of the representations, even though it also effectively contributes to prediction computation in RPL.

      What we will do: We will include a figure how how task performance varies with the predictor’s width and depth.

      Methodological Clarifications

      (3.5) The authors mention a contrastive learning comparison but provide few details. Since contrastive learning is primarily a technique to avoid collapse, it would be a more rigorous baseline if implemented within the same architecture as RPL to isolate the effect of the predictive objective.

      Thanks for the question. We already use the same network model as in RPL for the contrastive predictive learning (InfoNCE) baseline in Supplementary Table S1 and mentioned in the main text (l.164).

      What we will do: We will mention the architecture of the non-linear predictor used for InfoNCE baseline in Methods more explicitly.

      (3.6) In the PFC data comparison (Figure 7f), there appears to be a discrepancy where the local and global conditions show nearly identical results in PFC, while different dynamics in the model. It is unclear if this is a visualization error or a genuine model deviation.

      Thanks for picking up on this subtlety in the experimental results. To clarify, it is a model deviation but an interesting one. The local and global responses do look quite similar in the original PFC data. They differ in that the global oddball (xY|xx and xx|xY) response has a secondary peak that encodes the presence of the global oddball, whereas the initial response is actually dominated by local oddball encoding (xY vs xx). Concretely, this results in the response to the xx|xY condition only showing up weakly in the data and at a time lag with respect to the initial local oddball response. Our model, however, does not show the transient initial response to local oddballs in the decoding direction for global oddballs. In a sense, the network model encodes the global oddball concept more robustly than is seen in the PFC data. That said, whether this indicates a genuine difference in representational strategies that needs to be further accounted for, or whether it is an issue stemming from limited sub-sampling of PFC neurons, remains unclear.

      (3.7) The criteria for selecting specific model variables for comparison with V1 versus PFC are not explicitly defined. Clarification is needed on whether the same latent variables were used for both brain regions or if different layers were selected.

      To clarify, the successor-like representations in human V1 and abstract representations in macaque PFC are two different experiments, so each has different latent variables requiring different RPL models. The architecture used for each experiment is detailed in Methods and the criteria for selecting each architecture was the simplest that should work given the task complexity. Throughout the paper, all representation analysis is done on the output of integrator (c) unless said otherwise. We hope this resolves the confusion.

      References

      Chen, Xinlei et al. (2021). “Exploring simple siamese representation learning”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15750–15758.

      Srinath Halvagal, Manu et al. (2023). “Implicit variance regularization in non-contrastive SSL”. In: Advances in Neural Information Processing Systems 36, pp. 63409–63436.

      Mikulasch, Fabian A et al. (2026). Understanding Self-Supervised Learning via Latent Distribution Matching. arXiv: 2605.03517[cs.LG].

      Furutachi, Shohei, Alexis D. Franklin, et al. (Sept. 2024). “Cooperative thalamocortical circuit mechanism for sensory prediction errors”. en. In: Nature 633.8029. Publisher: Nature Publishing Group, pp. 398–406. issn: 1476-4687. doi: 10.1038/s41586-024-07851-w.

      Furutachi, Shohei and Sonja B Hofer (2026). “Rethinking Predictive Processing”. In: Annual Review of Neuroscience 49.

      Vasilevskaya, Anna et al. (2026). “A functional influence based circuit motif that constrains the set of plausible algorithms of cortical function”. In: bioRxiv. doi: 10.64898/2026.01.29.702557. eprint: https://www.biorxiv.org/content/early/2026/01/29/2026.01.29.702557.full. pdf.

      Nejad, Kevin Kermani et al. (July 2025). “Self-supervised predictive learning accounts for cortical layer-specificity”. en. In: Nat Commun 16.1, p. 6178. issn: 2041-1723. doi: 10.1038/s41467-025-61399-5.

      Ekman, Matthias et al. (Feb. 2023). “Successor-like representation guides the prediction of future events in human visual cortex and hippocampus”. In: eLife 12. Ed. by Morgan Barense et al., e78904. issn: 2050-084X. doi: 10.7554/eLife.78904.

      Dayan, Peter (1993). “Improving generalization for temporal difference learning: The successor representation”. In: Neural computation 5.4, pp. 613–624.

      Nøkland, Arild (2016). “Direct feedback alignment provides learning in deep neural networks”. In: Advances in neural information processing systems 29.

      Lillicrap, Timothy P et al. (2016). “Random synaptic feedback weights support error backpropagation for deep learning”. In: Nature communications 7.1, p. 13276.

      Zenke, Friedemann et al. (2018). “Superspike: Supervised learning in multilayer spiking neural networks”. In: Neural computation 30.6, pp. 1514–1541.

      Bellec, Guillaume et al. (2020). “A solution to the learning dilemma for recurrent networks of spiking neurons”. In: Nature communications 11.1, p. 3625.

      Illing, Bernd et al. (2021). “Local plasticity rules can learn deep representations using self-supervised contrastive predictions”. In: Advances in Neural Information Processing Systems 34.

      Zihan, Wu S et al. (2026). “Can Local Learning Match Self-Supervised Backpropagation?” In: arXiv preprint arXiv:2601.21683.

      Srivastava, Nitish et al. (2015). “Unsupervised learning of video representations using lstms”. In: International conference on machine learning. PMLR, pp. 843–852.

      Ghaemi, Hafez et al. (2024). “Seq-JEPA: Autoregressive Predictive Learning of Invariant-Equivariant World Models”. In: NeurIPS 2024 Workshop: Self-Supervised Learning - Theory and Practice.

      Garrido, Quentin et al. (2023). “Self-supervised learning of split invariant equivariant representations”. In: arXiv preprint arXiv:2302.10283.

      Hauri, Michael et al. (2026). “Dreamer-CDP: Improving Reconstruction-free World Models Via Continuous Deterministic Representation Prediction”. In: arXiv preprint arXiv:2603.07083.

      Oord, Aaron van den et al. (July 2018). “Representation Learning with Contrastive Predictive Coding”. In: arXiv:1807.03748 [cs, stat]. arXiv: 1807.03748.

      Bardes, Adrien et al. (2024). V-JEPA: Latent Video Prediction for Visual Representation Learning.

      Assran, Mido et al. (2025). “V-jepa 2: Self-supervised video models enable understanding, prediction and planning”. In: arXiv preprint arXiv:2506.09985.

      Drozdov, Katrina et al. (2024). “Video representation learning with joint-embedding predictive architectures”. In: arXiv preprint arXiv:2412.10925.

      LeCun, Yann (2022). “A Path Towards Autonomous Machine Intelligence Version 0.9.2, 2022-0627”. en. In.

      Asabuki, Toshitake et al. (2025). “Learning predictive signals within a local recurrent circuit”. In: Proceedings of the National Academy of Sciences 122.27, e2414674122. doi: 10.1073/pnas. 2414674122. eprint: https://www.pnas.org/doi/pdf/10.1073/pnas.2414674122.

      Yerxa, Thomas et al. (2024). “Contrastive-equivariant self-supervised learning improves alignment with primate visual area it”. In: Advances in neural information processing systems 37, pp. 96045–96070.

      Margalit, Eshed et al. (2024). “A unifying framework for functional organization in early and higher ventral visual cortex”. In: Neuron 112.14, pp. 2435–2451.

      Bakhtiari, Shahab et al. (2021). “The functional specialization of visual cortex emerges from training parallel pathways with self-supervised predictive learning”. In: Advances in Neural Information Processing Systems. Ed. by M. Ranzato et al. Vol. 34. Curran Associates, Inc., pp. 25164–25178.

      Niu, Julie Xueyan et al. (2024). “Learning predictable and robust neural representations by straightening image sequences”. In: Advances in Neural Information Processing Systems 37, pp. 40316– 40335.

      H´enaff, Olivier J et al. (2019). “Perceptual straightening of natural videos”. In: Nature neuroscience 22.6, pp. 984–991.

      Rossbroich, Julian et al. (2025). “Breaking Balance: Encoding local error signals in perturbations of excitation-inhibition balance”. In: bioRxiv, pp. 2025–05.

      Liu, Peng et al. (2025). “Layer-specific changes in sensory cortex across the lifespan in mice and humans”. In: Nature neuroscience 28.9, pp. 1978–1989.

      Carandini, Matteo et al. (2012). “Normalization as a canonical neural computation”. In: Nature reviews neuroscience 13.1, pp. 51–62.

    1. eLife Assessment

      This valuable study examines how the prelimbic cortex represents learned and generalized threat over time and identifies potentially distinct stable and dynamic subnetworks that may support these functions. The work is conceptually interesting and is strengthened by the longitudinal calcium imaging approach and the inclusion of key control groups. However, the evidence supporting the claims is incomplete, particularly because the interpretations regarding inference, time-dependent representational change, and the dissociation of neural activity from freezing behavior extend beyond what is currently established by the data.

    2. Reviewer #1 (Public review):

      Summary:

      The authors combine discriminative auditory fear conditioning with longitudinal in vivo calcium imaging to ask how prelimbic (PL) representations of learned and generalized threat evolve across recent and remote memory time points. Using two different CS+ frequencies and a no-shock control group, they report that PL population activity tracks graded behavioral generalization, that population similarity is highest for tones eliciting strong threat responding, and that distinct subnetworks can be identified that appear to encode tone-specific sensory features versus learned threat-related response structure.

      To my knowledge, this may be the first study to comprehensively examine neural encoding of fear generalization in prelimbic cortex (PL). The manuscript is ambitious and technically interesting, and several aspects are potentially important. In particular, the suggestion that neurons showing graded, learning-related response patterns become selectively stabilized over time is intriguing. The inclusion of two CS+ training conditions and a no-shock control also strengthens the case that at least some of the reported effects are related to associative learning rather than simple sensory differences. However, in its current form, the manuscript does not yet fully support the strength of the conceptual claims. Several issues limit confidence in the interpretation, including the possibility that repeated testing itself contributes to changes across days, uncertainty about the relationship between neural activity and freezing behavior, limited quantitative documentation of longitudinal cell registration, and a number of problems in figure clarity and statistical framing. Overall, the study contains promising observations, but the claims should be narrowed, and several analyses or controls would be needed to fully support the proposed framework.

      Detailed Comments

      (1) A general concern is that the repeated test procedure itself may contribute to extinction. Because the animals are exposed to multiple CS frequencies across multiple test days, and each tone is presented three times per session, some of the reported changes in behavior and neural activity across days could reflect extinction or repeated nonreinforced retrieval rather than the passage of time per se. This is especially relevant given that the manuscript makes claims about recent versus remote representations and representational drift over 30 days. At a minimum, the authors should discuss this limitation explicitly and temper claims about time-dependent changes. Ideally, they would include a control group in which animals are tested only once or twice (e.g., at an early and later time point with fewer CS frequencies), or a reduced-frequency testing design that minimizes extinction while still allowing evaluation of recent versus remote memory.

      (2) More generally, some of the reported learning-related neural differences may be driven by behavioral differences, particularly freezing, rather than by learning or generalization per se. For example, animals that freeze more to certain frequencies may show corresponding neural response differences simply because freezing alters PL activity. The authors should examine this possibility more directly. Analyses testing whether recorded cells encode freezing behavior, or whether tone frequency-related neural differences remain robust when comparing high- and low-freezing epochs, would help determine whether the reported effects reflect learned stimulus value rather than behavioral state differences.

      (3) A central feature of the manuscript is the analysis of neural response properties over an extended period of time, up to 30 days after learning. However, aside from a brief mention in the Methods that spatial registration was used, the manuscript provides very little quantitative information about this critical aspect of the study. The paper would be strengthened by including explicit metrics describing longitudinal cell tracking, such as the number and proportion of ROIs retained across all sessions, distributions of spatial-footprint correlations or centroid distances across days, and representative examples of matched imaging fields over time. Without this information, it is difficult to assess how strongly the longitudinal claims are supported.

      (4) The text states that "Figs. 1c and 1d show GCaMP6f expression in PL, representative calcium footprints, and activity traces". However, the figure as presented does not clearly show all of these elements, at least not in a way that matches the description in the Results. The correspondence between text and figure should be corrected.

      (5) The labeling of Figure 2a is insufficient for interpretation. The legend states that the panel shows raster plots of sound responsiveness, but the axes and scaling are not clearly defined. It is not clear from the figure what the x-axis represents, whether the y-axis corresponds to individual neurons, where the CS period occurs, or what the activity scale at the right denotes. Also, the term 'rasters' implies that spikes were analyzed. It seems that the spike inference approach (CASCADE) was only used for later analyses. Perhaps 'heat-plot' would be more accurate here? Generally, this figure should be annotated more clearly so that the reader can understand it without referring back to the Methods.

      (6) In relation to Figure 3, the analysis of population-averaged responses across tone frequencies is useful, but the manuscript would be stronger with additional statistical analyses across time and across groups. For example, if the authors want to argue that learning induces graded changes in neural responses and that these evolve across time, they should directly compare within-group responses across days and also compare matched frequencies between the conditioned groups and the no-shock controls. These analyses would help establish whether the observed differences are genuinely learning dependent and whether they change significantly over time.

      (7) The inclusion of two different CS+ frequencies and a no-shock control is a strength of the study and substantially improves the interpretation that graded neural responses are related to learning and generalization rather than to simple sensory processing or passage of time. That said, I am not entirely comfortable with the use of the term "inference" throughout the manuscript. What is being measured here appears closer to sensory generalization than inference in a stronger cognitive sense. The current task does not clearly require that animals infer hidden structure or stimulus value through abstract reasoning; rather, the generalized stimulus may simply be treated as similar to the conditioned cue. The terminology should therefore be reconsidered or softened.

      (8) I also found the use of the term "valence" somewhat problematic. The manuscript appears to use valence to refer to graded responding across tones with different aversive significance, but valence typically refers more broadly to distinctions between appetitive and aversive value. Here, terms such as "threat value," "aversive value," may be more precise. The authors should consider revising this language throughout.

    3. Reviewer #2 (Public review):

      Summary:

      The following points are those that occurred to me across readings of the paper. They are listed in what I take to be the order of their significance. Many of the points relate to the loose use of language and invocation of concepts that are not warranted, given the study design and results obtained.

      Major Comments:

      (1) The concept of ensemble turnover is interesting - the way it is introduced and discussed implies some type of spontaneous change in the neural underpinnings of fear discrimination and generalization in the PL. But, of course, every trial involves an opportunity to learn about the threat CS or the generalization test stimuli, and I am troubled by the thought that stability in the neural underpinnings of fear discrimination and generalization will actually reflect the level of defensive behaviours evoked on different trial types and/or the discrepancy between those behaviours and the outcome of a given trial in the generalization test. That is, stability in the neural underpinnings may be related to an animal's certainty or uncertainty in the contingency between a stimulus and danger; or, put another way, an animal's confidence that danger will or won't occur given the presence of some stimulus. This is not uninteresting. It is, however, not considered anywhere in the paper, which is overloaded with references to inferred threat values and integration of information across different types of stimuli. The protocol is not one that requires inference about anything or integration across anything.

      (2) I appreciate the link to Gu and Johansen in paragraph 3 of the Introduction, but the type of generalization under investigation here is not the same as the type of 'generalization' studied by Gu and Johansen [who used a sensory preconditioning protocol]. Nonetheless, the authors have forced the language used by Gu and Johansen into their paper, and this has created tension [at least for this reader] as the concepts introduced by Gu and Johansen [inference, integration] are simply not relevant given the generalization protocol used here. Here are a few examples of points where the tension might interfere with a reader's understanding:

      a. 'We hypothesized that generalization to novel stimuli depends on stable subnetwork organization that enables comparisons between learned and inferred valence, as well as population-level features that reduce variability across related representations.'

      I understand the words in the hypothesis, but can't form a representation of what is being said because of the reference to terms that stand in need of clarification [inferred valence, variability across related representations], but, ultimately, won't be clarified. This needs to be re-expressed so that the reader can appreciate what is being said.

      b. 'Our results show that stable cortical subnetworks integrate the emotional "gist" of memory and inferred valence for novel cues over time, despite ongoing ensemble reorganization, and that population-level firing rate similarity across stimulus presentations determines threat generalization.'

      Again, what does this mean? How is the gist of a memory integrated with inferred valence for novel cues over time? The statement simply doesn't make sense. This needs to be rewritten for clarity.

      c. 'In CS⁺15 mice, positively modulated sound-responsive neurons exhibited graded tone activity reflecting the contingency learned valence as well as the inferred valence of novel tones across testing days...'.

      Can this be rewritten as 'In CS⁺15 mice, positively modulated sound-responsive neurons exhibited graded activity to the tone CS and its variants that were used to assess generalization.'? The overloading of the text with references to 'contingency learned valence' and 'inferred valence' is unnecessary and makes it much harder to understand what has been shown in the results.

      (3) Re the same passage of text as in 2c:

      Is it the case that these neurons are simply tracking the expression of freezing to the various tones? The same question applies to the results obtained for the CS+3 mice. If this is the case, then why should the results be taken to support the banner statement that 'Sound-modulated PL population responses encode learned and inferred valence' - these analyses do not support that statement. And, as indicated, I don't believe that the language of learned and inferred valence is appropriate to such statements, given the nature of the protocol used and results obtained. It is a study looking at how populations of neurons in the PL respond during presentations of auditory stimuli that were subject to discriminative conditioning, and during tests of generalized freezing to other [intermediate] auditory stimuli.

      (4) It is stated that:

      'In no-shock controls, although both positive and negative responses were present, population activity was not modulated by tone frequency or valence'.

      What does this mean? I can understand that population activity was not modulated by tone frequency. But what does it mean to say that it was not modulated by valence? Why should it have been when none of the tones were conditioned in this group and, hence, mice were responding to all the tones equally? And given that this is true, I don't understand the use of 'valence' here, or the subsequent statements in this paragraph that 'graded responses require associative learning' and that 'PL population responses encode graded sound-valence associations that reflect both learning and inference, closely matching behavioral generalization.' The latter statement is particularly unwarranted and, again, highlights a major issue with the paper. It could and should be rewritten as 'PL population responses reflect behavioral generalization.' There is nothing in the additional language that adds to the reader's understanding of what has been shown. The reference to 'graded sound-valence associations that reflect both learning and inference' is completely unwarranted, given the nature of this study. It is anathema to the vast literature on stimulus generalization. If the authors wished to make statements of this sort, they should have taken a different approach, perhaps using protocols like those featured in Gu and Johansen.

      (5) The section titled, 'Consistently active neurons preserve valence representations as newly recruited neurons sharpen remote memory traces' ends with the following summary:

      'Together, these results indicate that consistently active neurons maintain stable representations of learned and inferred sound associations across time, whereas neurons recruited after conditioning progressively acquire graded tuning at later retrieval stages. This dynamic refinement suggests that cortical memory representations become increasingly selective during systems consolidation, while a stable neuronal subpopulation preserves the core emotional content of the memory.'

      Once again, the summary is not in keeping with the results obtained. The 'dynamic refinement' of representations is far more likely to reflect the repeated testing across days 1, 15, and 30 rather than anything to do with systems consolidation - at the very least, it is the simplest interpretation of the results. The impact of repeated testing is evident in the sharpening of generalization gradients over time, which is contrary to what is otherwise observed in the literature - the incredibly well -documented broadening of generalization gradients with time. Given this impact of repeated testing, surely the changes in the neuronal population that underlie performance are more likely to reflect the learning that occurs on days 1, 15, and 30, which is reflected in reduced freezing to the non-conditioned tones. If this is a reasonable take on the results, then I don't see the basis for invoking systems consolidation at all, and I don't see the basis for inferring a stable neuronal subpopulation that preserves the emotional content of the memory. Rather, non-reinforced presentations of 'never-reinforced' tones result in recruitment of additional neurons that result in suppression of freezing responses to those stimuli.

      (6) In the section titled, 'Population vector similarity at stimulus onset determines degree of generalization', it is stated that:

      'Because population similarity peaked shortly after stimulus onset, we quantified similarity during the first 5 s after tone onset relative to the CS⁺. In CS⁺15 mice, population similarity was highest for 15/15 and 15/11 tone pairs with no differences between them.'

      Isn't this consistent with the view that the population response in the PL simply reflects the level of freezing? Freezing to the 15-15 and 15-11 tones is most likely to be similar on their first presentation prior to the effects of extinction on the 11 Hz tone; hence the results obtained. That is, these results appear to clearly indicate that neuronal responses in the PL reflect the degree of stimulus generalization, as evidenced in freezing behavior. Given all that we know about the involvement of the PL in expressing fear responses, it is not appropriate to claim that 'population vector similarity at stimulus onset *determines* the degree of generalization. The PL responses simply reflect the varying levels of performance displayed to the different types of tones. What have I missed that could be taken to support additional statements?

      Later in the same section, it is stated that 'population-level similarity at stimulus onset scales with behavioral threat generalization and is maximal for tones associated with robust threat responses.' For simplicity and, therefore, clarity, this should be rewritten as 'population-level similarity at stimulus onset reflects behavioral threat generalization.'

      (7) In the section titled, 'Different subnetworks encode acoustic versus learned properties of sound association', it is stated that:

      'Our previous analyses show that learned and inferred associations are represented at the population level. However, these results do not resolve whether graded responses arise from pooled activity of frequency-selective neurons or from subnetworks encoding integrated learned valence across tones.'

      What does it mean to say 'integrated learned valence across tones'? As it presently stands, the meaning of the phrase is unclear. It only makes sense if one supposes that generalized freezing responses to the 11 and 7 kHZ tones reflect separate associations between those tones and the aversive foot shock US. This supposition is inconsistent with the rich literature on generalization of Pavlovian conditioned fear responses. Specifically, it is inconsistent with the many theories of fear generalization, which attribute the reduction in fear as one moves away from the specific conditioned stimulus to a decrement in the ability of the test stimulus to activate the trained CS-US association. My strong impression is that the authors would do well to ground their findings in theories of stimulus/fear generalization, of which there are many. This would better serve the results obtained [and the reader's appreciation of them] - at present, the unnecessary invocation of concepts does very little to enhance the reader's appreciation or understanding of what has been found in the study.

      (8) Another example of what has been a common theme in this review :

      '...we hypothesized that the PL active ensemble segregates into functionally distinct subnetworks: one encoding tone-specific sensory features with dynamic characteristics, and another responding to all frequencies encoding stable core memory content and inferred emotional valence.'

      What does it mean to say 'all frequencies encoding stable core memory content and inferred emotional valence'? Do the authors mean to say '...and another that tracks freezing/defensive responses regardless of whether they were elicited by the trained CS or one of the generalization test stimuli'?

      (9) It is stated that - 'Graded clusters encode emotional valence but constitute only a fraction of the active population; yet valence coding at the population level remains accurate and precise. This indicates that neurons newly recruited into the population-likely frequency-selective and organized within learning-independent clusters-can be shaped by associative processes through modulation of firing activity.'

      What does this mean? Are the authors trying to say that - 'Some clusters of PL neurons track freezing responses. In spite of the fact that these are only a fraction of the total active neuronal population, the population-level response of PL neurons also tracks the levels of fear to the trained tone and its variants used in the test for generalization.' If this is what one wants to say, then the final statement in the reproduced section does not follow. That is, there is no indication that 'neurons newly recruited into the population-likely frequency-selective and organized within learning-independent clusters-can be shaped by associative processes through modulation of firing activity.' As noted, the characteristics of other ensembles that become active across the repeated tests on days 1, 15, and 30 are more likely to reflect learning from non-reinforcement that occurs within and across those sessions. Perhaps this is what is meant by the phrase, 'shaped by associative processes'? If so, it should be stated explicitly instead of left to the reader to work out.

      (10) The following points all relate to the Discussion and reiterate many of the points above.

      a. 'A subset of neurons remains consistently active across sessions, preserving core components of the memory trace and supporting inference of emotional valence for novel sounds, while neurons recruited after conditioning progressively acquire valence selectivity at remote time points.'

      'Inference of emotional valence' is unclear and unwarranted for all of the reasons provided above regarding the use of language.

      b. '...Our data reconcile these views by demonstrating that cortical representations of emotional valence emerge rapidly after learning and persist within stable subnetworks, even as the broader population undergoes substantial turnover. This architecture preserves core mnemonic content while allowing flexibility in the surrounding ensemble.'

      These statements assume that the PL neuronal responses reflect something more than the levels of freezing behavior to the different stimuli; what are the grounds for this assumption?

      c. 'Importantly, these subnetworks encode both learned contingencies and the inferred valence of novel stimuli along a graded representational axis, suggesting that strong recurrent connectivity provides a stable scaffold for emotional memory representations.'

      What is a graded representational axis, and what part of the first statement suggests that 'strong recurrent connectivity provides a stable scaffold for emotional memory representations'? If the authors' goal was to make statements about emotional memory representations vis-à-vis emotional memory content, they should have used protocols that allowed them to probe such content. The auditory fear conditioning protocol used here [followed by tests for generalization to other auditory stimuli that differ in frequency from the conditioned tone] is not one that lends itself to analysis of emotional memory representations or content.

      d. 'Dynamic tone-selective responsive neurons emerge independently of learning, as they are present in both control and experimental mice, reflecting pre-existing PL sensory-driven properties (Hockley & Malmierca, 2024; Zikopoulos & Barbas, 2006).'

      Maybe. They are also likely to have developed as a consequence of the repeated testing on days 1, 15, and 30, which involved intermixed exposures to the tones of different frequencies. That is, rather than 'pre-existing PL sensory-driven properties', the responses of these neurons might reflect the emergence of discrimination between the various tones across testing, and greater suppression of freezing to the non-trained tones compared to the trained tone across the various test intervals.

    4. Reviewer #3 (Public review):

      Summary:

      Normandin et al. explore the coding of stimuli predicting an aversive event in the prelimbic cortex. Stimuli could either be explicitly paired, explicitly unpaired, or novel but with an inferred association with the aversive event (generalization). Long-term tracking of GCaMP-positive neurons allowed them to examine how coding evolves out to a month following training. In general, they found two types of ensemble codes. One was ensembles coding for each stimulus independently, but with enhanced responding to the one eliciting a freezing response. The other was ensembles that responded to all stimuli in proportion to their similarity to the stimulus paired with the aversive event, either increasing or decreasing their activation with the degree of freezing elicited by a stimulus. Importantly, this second set of ensembles was more stable across days, potentially providing a memory trace.

      Strengths:

      (1) The authors track ensembles in prelimbic cortex over long time scales, providing valuable information on the consolidation of neural codes.

      (2) Neural coding of generalization is examined, which is under-examined in the field.

      Weaknesses:

      (1) Difficult to determine if responses treated as encoding stimulus valence are driven instead by the behavior that the stimulus elicits, freezing.

      (2) The study implies that the identified ensembles are causally related to valence memory, but no experimental interventions are performed to justify this.

    5. Author response:

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      The authors combine discriminative auditory fear conditioning with longitudinal in vivo calcium imaging to ask how prelimbic (PL) representations of learned and generalized threat evolve across recent and remote memory time points. Using two different CS+ frequencies and a no-shock control group, they report that PL population activity tracks graded behavioral generalization, that population similarity is highest for tones eliciting strong threat responding, and that distinct subnetworks can be identified that appear to encode tone-specific sensory features versus learned threat-related response structure.

      To my knowledge, this may be the first study to comprehensively examine neural encoding of fear generalization in prelimbic cortex (PL). The manuscript is ambitious and technically interesting, and several aspects are potentially important. In particular, the suggestion that neurons showing graded, learning-related response patterns become selectively stabilized over time is intriguing. The inclusion of two CS+ training conditions and a no-shock control also strengthens the case that at least some of the reported effects are related to associative learning rather than simple sensory differences. However, in its current form, the manuscript does not yet fully support the strength of the conceptual claims. Several issues limit confidence in the interpretation, including the possibility that repeated testing itself contributes to changes across days, uncertainty about the relationship between neural activity and freezing behavior, limited quantitative documentation of longitudinal cell registration, and a number of problems in figure clarity and statistical framing. Overall, the study contains promising observations, but the claims should be narrowed, and several analyses or controls would be needed to fully support the proposed framework.

      Detailed Comments

      (1) A general concern is that the repeated test procedure itself may contribute to extinction. Because the animals are exposed to multiple CS frequencies across multiple test days, and each tone is presented three times per session, some of the reported changes in behavior and neural activity across days could reflect extinction or repeated nonreinforced retrieval rather than the passage of time per se. This is especially relevant given that the manuscript makes claims about recent versus remote representations and representational drift over 30 days. At a minimum, the authors should discuss this limitation explicitly and temper claims about time-dependent changes. Ideally, they would include a control group in which animals are tested only once or twice (e.g., at an early and later time point with fewer CS frequencies), or a reduced-frequency testing design that minimizes extinction while still allowing evaluation of recent versus remote memory.

      We agree with the reviewer that repeated testing is an inherent limitation of longitudinal memory studies and may itself contribute to some neural changes across sessions. However, several aspects of our behavioral design and results argue against extinction or repeated nonreinforced retrieval as the primary drivers of the observed effects. Importantly, discrimination ratios remained stable or increased across time rather than progressively diminishing as would be expected under extinction (this new analysis will be added to the resubmission). Nevertheless, we will address this important point in the Discussion and explicitly acknowledge that repeated retrieval may contribute to some component of the observed representational changes.

      (2) More generally, some of the reported learning-related neural differences may be driven by behavioral differences, particularly freezing, rather than by learning or generalization per se. For example, animals that freeze more to certain frequencies may show corresponding neural response differences simply because freezing alters PL activity. The authors should examine this possibility more directly. Analyses testing whether recorded cells encode freezing behavior, or whether tone frequency-related neural differences remain robust when comparing high- and low-freezing epochs, would help determine whether the reported effects reflect learned stimulus value rather than behavioral state differences.

      We thank the reviewer for raising this important point, which was also noted by the other reviewers. To address this issue, we will implement Reviewer 3’s suggested Generalized Linear Model (GLM) analysis using inferred spiking activity derived from the Ca2+ signals, with both tone identity and freezing behavior included as predictors. Because freezing behavior varies across trials whereas stimulus identity is fixed, this approach will allow us to dissociate their respective contributions to neuronal activity. If, after accounting for freezing behavior, responsive neurons continue to exhibit graded coding consistent with inferred threat value, this would strengthen the interpretation that the identified ensembles reflect generalization gradients related to aversive value rather than freezing behavior alone. Otherwise, we will adjust the conclusions according to the interpretation that freezing itself drives the generalization gradients.

      (3) A central feature of the manuscript is the analysis of neural response properties over an extended period of time, up to 30 days after learning. However, aside from a brief mention in the Methods that spatial registration was used, the manuscript provides very little quantitative information about this critical aspect of the study. The paper would be strengthened by including explicit metrics describing longitudinal cell tracking, such as the number and proportion of ROIs retained across all sessions, distributions of spatial-footprint correlations or centroid distances across days, and representative examples of matched imaging fields over time. Without this information, it is difficult to assess how strongly the longitudinal claims are supported.

      We thank the reviewer for this suggestion. We will include measures of registration quality in the resubmission.

      (4) The text states that "Figs. 1c and 1d show GCaMP6f expression in PL, representative calcium footprints, and activity traces". However, the figure as presented does not clearly show all of these elements, at least not in a way that matches the description in the Results. The correspondence between text and figure should be corrected.

      We will correct correspondence between text and Figure.

      (5) The labeling of Figure 2a is insufficient for interpretation. The legend states that the panel shows raster plots of sound responsiveness, but the axes and scaling are not clearly defined. It is not clear from the figure what the x-axis represents, whether the y-axis corresponds to individual neurons, where the CS period occurs, or what the activity scale at the right denotes. Also, the term 'rasters' implies that spikes were analyzed. It seems that the spike inference approach (CASCADE) was only used for later analyses. Perhaps 'heat-plot' would be more accurate here? Generally, this figure should be annotated more clearly so that the reader can understand it without referring back to the Methods.

      Thank you for this suggestion. We will clarify the labelling of the Figure 2a and call the graphs “activity-plots”.

      (6) In relation to Figure 3, the analysis of population-averaged responses across tone frequencies is useful, but the manuscript would be stronger with additional statistical analyses across time and across groups. For example, if the authors want to argue that learning induces graded changes in neural responses and that these evolve across time, they should directly compare within-group responses across days and also compare matched frequencies between the conditioned groups and the no-shock controls. These analyses would help establish whether the observed differences are genuinely learning dependent and whether they change significantly over time.

      We will redo the Statistics of Figure 3 to take into account the following variables: group (CS15, CS3, no shocks), frequency (3, 7, 11, 15), and day of testing (2, 15, 30).

      (7) The inclusion of two different CS+ frequencies and a no-shock control is a strength of the study and substantially improves the interpretation that graded neural responses are related to learning and generalization rather than to simple sensory processing or passage of time. That said, I am not entirely comfortable with the use of the term "inference" throughout the manuscript. What is being measured here appears closer to sensory generalization than inference in a stronger cognitive sense. The current task does not clearly require that animals infer hidden structure or stimulus value through abstract reasoning; rather, the generalized stimulus may simply be treated as similar to the conditioned cue. The terminology should therefore be reconsidered or softened.

      We thank the reviewer for appreciating the strengths of the experimental design and for this thoughtful suggestion regarding terminology. We agree that the term “inference” may overstate the cognitive processes engaged by the current task. Accordingly, we will revise the terminology throughout the manuscript to describe these effects as graded generalization of threat value across stimuli.

      (8) I also found the use of the term "valence" somewhat problematic. The manuscript appears to use valence to refer to graded responding across tones with different aversive significance, but valence typically refers more broadly to distinctions between appetitive and aversive value. Here, terms such as "threat value," "aversive value," may be more precise. The authors should consider revising this language throughout.

      We will correct the language and use “threat value”.

      Reviewer #2 (Public review):

      Summary:

      The following points are those that occurred to me across readings of the paper. They are listed in what I take to be the order of their significance. Many of the points relate to the loose use of language and invocation of concepts that are not warranted, given the study design and results obtained.

      Major Comments:

      (1) The concept of ensemble turnover is interesting - the way it is introduced and discussed implies some type of spontaneous change in the neural underpinnings of fear discrimination and generalization in the PL. But, of course, every trial involves an opportunity to learn about the threat CS or the generalization test stimuli, and I am troubled by the thought that stability in the neural underpinnings of fear discrimination and generalization will actually reflect the level of defensive behaviours evoked on different trial types and/or the discrepancy between those behaviours and the outcome of a given trial in the generalization test. That is, stability in the neural underpinnings may be related to an animal's certainty or uncertainty in the contingency between a stimulus and danger; or, put another way, an animal's confidence that danger will or won't occur given the presence of some stimulus. This is not uninteresting. It is, however, not considered anywhere in the paper, which is overloaded with references to inferred threat values and integration of information across different types of stimuli. The protocol is not one that requires inference about anything or integration across anything.

      We thank the reviewer for these important points, which we address in further detail below.

      Ongoing learning during test sessions: The reviewer correctly notes that unreinforced test presentations may constitute extinction-learning trials and that some neural changes across days could therefore reflect ongoing learning rather than spontaneous ensemble reorganization. However, new analyses indicate that extinction is unlikely to be the primary driver of our findings. Discrimination ratios do not decay over time; instead, they either sharpen or remain stable across sessions (new analyses to be included in the resubmission). These results argue against robust extinction as the primary source of the neural changes observed across sessions. This interpretation is also consistent with the strength of our conditioning protocol, which used 10 CS+ shock pairings and 10 CS− no-shock pairings specifically to minimize extinction across repeated testing sessions. Nevertheless, we acknowledge that the current design cannot fully dissociate time-dependent consolidation from retrieval-induced plasticity, and we will explicitly discuss this limitation in the revised Discussion.

      Stability reflecting behavioral consistency: We agree this alternative cannot be fully excluded. However, the cluster stability analyses assess identity at the level of response profile across all four frequencies, not response magnitude alone. Tone-selective clusters, which also show consistent behavioral correlates (firing rate correlates with threat-value, Fig. S8), do not show equivalent profile stability, suggesting that the stability of graded clusters is not simply a consequence of behavioral consistency. This point will be added to the Discussion in the resubmission.

      Language of "inference" and "integration": The reviewer is correct that responses to novel tones are consistent with graded stimulus generalization. We will substantially revise the manuscript to replace "inference" and "integration" with more precise language describing graded frequency generalization gradients.

      (2) I appreciate the link to Gu and Johansen in paragraph 3 of the Introduction, but the type of generalization under investigation here is not the same as the type of 'generalization' studied by Gu and Johansen [who used a sensory preconditioning protocol]. Nonetheless, the authors have forced the language used by Gu and Johansen into their paper, and this has created tension [at least for this reader] as the concepts introduced by Gu and Johansen [inference, integration] are simply not relevant given the generalization protocol used here. Here are a few examples of points where the tension might interfere with a reader's understanding:

      We thank the reviewer for these specific and constructive criticisms. We will revise the manuscript throughout to remove or redefine terms like "inferred valence" and "integration," replacing them with clearer, more accurate descriptions of gradient generalization of threat value. Below we address each point raised by the reviewer regarding terminology clarifications.

      (a) 'We hypothesized that generalization to novel stimuli depends on stable subnetwork organization that enables comparisons between learned and inferred valence, as well as population-level features that reduce variability across related representations.'

      I understand the words in the hypothesis, but can't form a representation of what is being said because of the reference to terms that stand in need of clarification [inferred valence, variability across related representations], but, ultimately, won't be clarified. This needs to be re-expressed so that the reader can appreciate what is being said.

      The hypothesis will be rewritten as: "We hypothesized that generalization to tones acoustically similar to the CS+ and CS− depends on the emergence of stable ensembles encoding threat value, and that population-level response similarity across stimuli would correlate with the degree of behavioral fear generalization, consistent with prior work in auditory cortex [1]."

      (b) 'Our results show that stable cortical subnetworks integrate the emotional "gist" of memory and inferred valence for novel cues over time, despite ongoing ensemble reorganization, and that population-level firing rate similarity across stimulus presentations determines threat generalization.'

      Again, what does this mean? How is the gist of a memory integrated with inferred valence for novel cues over time? The statement simply doesn't make sense. This needs to be rewritten for clarity.

      The summary statement will be rewritten: "Our results show that stable cortical sub-ensembles preserve the emotional content of the fear memory over time, despite ongoing ensemble reorganization, and that population-level firing rate similarity in response to tones associated with threat correlates with the degree of behavioral threat generalization."

      (c) 'In CS⁺15 mice, positively modulated sound-responsive neurons exhibited graded tone activity reflecting the contingency learned valence as well as the inferred valence of novel tones across testing days...'.

      Can this be rewritten as 'In CS⁺15 mice, positively modulated sound-responsive neurons exhibited graded activity to the tone CS and its variants that were used to assess generalization.'? The overloading of the text with references to 'contingency learned valence' and 'inferred valence' is unnecessary and makes it much harder to understand what has been shown in the results.

      We will adopt the reviewer's suggested rewording: "In CS+15 mice, positively modulated sound-responsive neurons exhibited graded activity to the tone CS and its variants that were used to assess generalization."

      We will systematically review the entire manuscript to ensure consistency with this revised framing.

      (3) Re the same passage of text as in 2c:

      Is it the case that these neurons are simply tracking the expression of freezing to the various tones? The same question applies to the results obtained for the CS+3 mice. If this is the case, then why should the results be taken to support the banner statement that 'Sound-modulated PL population responses encode learned and inferred valence' - these analyses do not support that statement. And, as indicated, I don't believe that the language of learned and inferred valence is appropriate to such statements, given the nature of the protocol used and results obtained. It is a study looking at how populations of neurons in the PL respond during presentations of auditory stimuli that were subject to discriminative conditioning, and during tests of generalized freezing to other [intermediate] auditory stimuli.

      The reviewer is correct that the graded population responses observed in PL could reflect freezing behavior across tone frequencies rather than encoding an abstract threat-value representation. This important concern was also raised by other reviewers. To address it directly, we will follow Reviewer 3’s suggestion and implement a Generalized Linear Model (GLM) using inferred spiking activity derived from the Ca2+ signals, with both tone identity and freezing behavior included as predictors. This analysis will allow us to dissociate the respective contributions of tone frequency and freezing to the graded neural responses. Based on the outcome of this analysis, we will revise and appropriately adjust our conclusions.

      In addition, we will revise the section heading and surrounding text to remove the terminology of “learned and inferred valence.” Instead, the findings will be described more conservatively as: “PL population responses reflect behavioral generalization to auditory stimuli following discriminative fear conditioning.”

      (4) It is stated that:

      'In no-shock controls, although both positive and negative responses were present, population activity was not modulated by tone frequency or valence'.

      What does this mean? I can understand that population activity was not modulated by tone frequency. But what does it mean to say that it was not modulated by valence? Why should it have been when none of the tones were conditioned in this group and, hence, mice were responding to all the tones equally? And given that this is true, I don't understand the use of 'valence' here, or the subsequent statements in this paragraph that 'graded responses require associative learning' and that 'PL population responses encode graded sound-valence associations that reflect both learning and inference, closely matching behavioral generalization.' The latter statement is particularly unwarranted and, again, highlights a major issue with the paper. It could and should be rewritten as 'PL population responses reflect behavioral generalization.' There is nothing in the additional language that adds to the reader's understanding of what has been shown. The reference to 'graded sound-valence associations that reflect both learning and inference' is completely unwarranted, given the nature of this study. It is anathema to the vast literature on stimulus generalization. If the authors wished to make statements of this sort, they should have taken a different approach, perhaps using protocols like those featured in Gu and Johansen.

      The reviewer is correct that controls do not form threat associations; however, these animals still could respond differentially to distinct frequencies, something that is not reflected in the data. We will correct the section indicating that distinct neutral frequencies do not produce graded responses: "graded responses require associative learning" will be retained but reframed simply as: "graded frequency-dependent population responses were absent in animals that did not receive fear conditioning." The concluding statement of the paragraph will be rewritten as: "PL population responses reflect behavioral generalization to acoustically similar stimuli following discriminative conditioning," in line with the reviewer's suggestion.

      (5) The section titled, 'Consistently active neurons preserve valence representations as newly recruited neurons sharpen remote memory traces' ends with the following summary:

      'Together, these results indicate that consistently active neurons maintain stable representations of learned and inferred sound associations across time, whereas neurons recruited after conditioning progressively acquire graded tuning at later retrieval stages. This dynamic refinement suggests that cortical memory representations become increasingly selective during systems consolidation, while a stable neuronal subpopulation preserves the core emotional content of the memory.'

      Once again, the summary is not in keeping with the results obtained. The 'dynamic refinement' of representations is far more likely to reflect the repeated testing across days 1, 15, and 30 rather than anything to do with systems consolidation - at the very least, it is the simplest interpretation of the results. The impact of repeated testing is evident in the sharpening of generalization gradients over time, which is contrary to what is otherwise observed in the literature - the incredibly well -documented broadening of generalization gradients with time. Given this impact of repeated testing, surely the changes in the neuronal population that underlie performance are more likely to reflect the learning that occurs on days 1, 15, and 30, which is reflected in reduced freezing to the non-conditioned tones. If this is a reasonable take on the results, then I don't see the basis for invoking systems consolidation at all, and I don't see the basis for inferring a stable neuronal subpopulation that preserves the emotional content of the memory. Rather, non-reinforced presentations of 'never-reinforced' tones result in recruitment of additional neurons that result in suppression of freezing responses to those stimuli.

      We respectfully disagree with the reviewer’s interpretation. While repeated testing cannot be entirely excluded as a contributing factor, several lines of evidence suggest that it cannot fully account for our observations.

      Regarding extinction: discrimination ratios between CS+ and all other frequencies either remained stable or increased over time (new analysis included in resubmission), indicating that animals continued to discriminate threat value across the testing period rather than showing the progressive suppression expected under extinction — the opposite of what we observe.

      Regarding the recruitment of new neurons: repeated non-reinforced tone exposure would be expected to produce stimulus-specific adaptation — characterized by reduced, less discriminative neural responsiveness and flatter tuning profiles [2]— not the progressive sharpening we observe. The same would be expected if these neurons represent or are associated with new extinction learning.

      Finally, sharpening of generalization gradients during repeated within-subjects testing has been reported previously [3], suggesting that successive exposures may promote more precise discrimination in some cases. Consistent with this, discrimination learning has also been shown to narrow or sharpen fear generalization gradients rather than broaden them [4], supporting the idea that discriminative conditioning enhances stimulus specificity during testing. Although we cannot exclude the possibility that more extended training could eventually broaden the generalization gradient, under the training parameters and temporal window used in our study, the data support a progressive sharpening of the gradient over time. In the revised Discussion, we will present systems consolidation as the primary interpretive framework and further elaborate on why repeated testing is unlikely to account for the full pattern of behavioral and neural findings reported here.

      (6) In the section titled, 'Population vector similarity at stimulus onset determines degree of generalization', it is stated that:

      'Because population similarity peaked shortly after stimulus onset, we quantified similarity during the first 5 s after tone onset relative to the CS⁺. In CS⁺15 mice, population similarity was highest for 15/15 and 15/11 tone pairs with no differences between them.'

      Isn't this consistent with the view that the population response in the PL simply reflects the level of freezing? Freezing to the 15-15 and 15-11 tones is most likely to be similar on their first presentation prior to the effects of extinction on the 11 Hz tone; hence the results obtained. That is, these results appear to clearly indicate that neuronal responses in the PL reflect the degree of stimulus generalization, as evidenced in freezing behavior. Given all that we know about the involvement of the PL in expressing fear responses, it is not appropriate to claim that 'population vector similarity at stimulus onset *determines* the degree of generalization. The PL responses simply reflect the varying levels of performance displayed to the different types of tones. What have I missed that could be taken to support additional statements?

      The GLM analysis described in our response to reviewers 1 and 3 will directly address the contribution of freezing. We will report these results in the resubmission and revise the interpretive language in the manuscript accordingly.

      However, regarding the analysis of population vector similarity, we need to clarify a point of confusion. The reviewer states “Freezing to the 15-15 and 15-11 tones is most likely to be similar on their first presentation prior to the effects of extinction on the 11 Hz tone; hence the results obtained”. The similarity vectors were calculated by correlating activity across all tone presentations within each testing day, not only the first two presentations. In Fig. 4, “Early” and “Late” refer to the order of a tone within a trial, which we will clarify more explicitly in the resubmission. Notably, repeated-measures analyses did not reveal any effect of the time variable (Fig. 4e,f), indicating that similarity across tone presentations remained high for tones associated with high threat value. Importantly, our data showed no evidence that responses to 11 kHz or 15 kHz in the CS15 group, or to 3 kHz in the CS3 group, exhibited extinction-like patterns at either the behavioral or neural level. Therefore, the persistence of high population similarity across time provides additional evidence against extinction as the primary explanation for our findings.

      We will remove the word "determines" from the manuscript, as our data cannot conclusively establish a causal relationship.

      Later in the same section, it is stated that 'population-level similarity at stimulus onset scales with behavioral threat generalization and is maximal for tones associated with robust threat responses.' For simplicity and, therefore, clarity, this should be rewritten as 'population-level similarity at stimulus onset reflects behavioral threat generalization.'

      We will make this correction.

      (7) In the section titled, 'Different subnetworks encode acoustic versus learned properties of sound association', it is stated that:

      'Our previous analyses show that learned and inferred associations are represented at the population level. However, these results do not resolve whether graded responses arise from pooled activity of frequency-selective neurons or from subnetworks encoding integrated learned valence across tones.'

      What does it mean to say 'integrated learned valence across tones'? As it presently stands, the meaning of the phrase is unclear. It only makes sense if one supposes that generalized freezing responses to the 11 and 7 kHZ tones reflect separate associations between those tones and the aversive foot shock US. This supposition is inconsistent with the rich literature on generalization of Pavlovian conditioned fear responses. Specifically, it is inconsistent with the many theories of fear generalization, which attribute the reduction in fear as one moves away from the specific conditioned stimulus to a decrement in the ability of the test stimulus to activate the trained CS-US association. My strong impression is that the authors would do well to ground their findings in theories of stimulus/fear generalization, of which there are many. This would better serve the results obtained [and the reader's appreciation of them] - at present, the unnecessary invocation of concepts does very little to enhance the reader's appreciation or understanding of what has been found in the study.

      We thank the reviewer for raising this point. The phrase "integrated learned valence across tones" refers specifically to a subpopulation of neurons that respond to all four frequencies in a graded manner, with response magnitude scaling according to threat value. This is distinct from tone-selective neurons, which respond preferentially to a single frequency. The neurons responding to all tones in a graded manner are present only in conditioned animals and not in no-shock controls, demonstrating that their graded response profile is shaped by associative learning.

      We agree, however, that the phrase "integrated learned valence" is unnecessarily opaque and we will replace it with more precise language: these neurons will be described as showing graded frequency-dependent responses whose magnitude scales with threat value. We believe this subpopulation represents a genuinely novel finding that complements the behavioral generalization literature by identifying a specific neural substrate for the generalization gradient within PL.

      (8) Another example of what has been a common theme in this review:

      '...we hypothesized that the PL active ensemble segregates into functionally distinct subnetworks: one encoding tone-specific sensory features with dynamic characteristics, and another responding to all frequencies encoding stable core memory content and inferred emotional valence.'

      What does it mean to say 'all frequencies encoding stable core memory content and inferred emotional valence'? Do the authors mean to say '...and another that tracks freezing/defensive responses regardless of whether they were elicited by the trained CS or one of the generalization test stimuli'?

      As stated in our previous responses, in the resubmission we will determine the contribution of freezing. If we find that freezing predicts graded neural responses, we will adjust the language of the manuscript.

      (9) It is stated that - 'Graded clusters encode emotional valence but constitute only a fraction of the active population; yet valence coding at the population level remains accurate and precise. This indicates that neurons newly recruited into the population-likely frequency-selective and organized within learning-independent clusters-can be shaped by associative processes through modulation of firing activity.'

      What does this mean? Are the authors trying to say that - 'Some clusters of PL neurons track freezing responses. In spite of the fact that these are only a fraction of the total active neuronal population, the population-level response of PL neurons also tracks the levels of fear to the trained tone and its variants used in the test for generalization.' If this is what one wants to say, then the final statement in the reproduced section does not follow. That is, there is no indication that 'neurons newly recruited into the population-likely frequency-selective and organized within learning-independent clusters-can be shaped by associative processes through modulation of firing activity.' As noted, the characteristics of other ensembles that become active across the repeated tests on days 1, 15, and 30 are more likely to reflect learning from non-reinforcement that occurs within and across those sessions. Perhaps this is what is meant by the phrase, 'shaped by associative processes'? If so, it should be stated explicitly instead of left to the reader to work out.

      We thank the reviewer for highlighting the lack of clarity in this passage and agree that the original phrasing was insufficiently precise. What we intended to convey is that only a subset of PL neurons displays graded tuning that tracks behavioral generalization across tones. Nevertheless, despite constituting only a fraction of the total active population, this graded coding is also reflected at the population level. Therefore, we suggest that neurons recruited into the active population after conditioning — likely frequency-selective neurons — contribute to the graded population responses through changes in their firing-rate activity, which is modulated by threat value (Fig. S8). We will rewrite this passage in the resubmission to make this interpretation explicit rather than leaving it to the reader to infer.

      Regarding the reviewer's suggestion that the characteristics of newly recruited neurons more likely reflect learning from non-reinforced exposures during repeated test sessions, we respectfully maintain that this interpretation is difficult to reconcile with two aspects of our data. First, graded-response neurons are absent in no-shock controls that are exposed to nonreinforced repeated testing. Second, as detailed in our responses to previous points, the progressive sharpening of population responses over time is inconsistent with what would be expected from repeated non-reinforced exposure, which would more plausibly produce broader or flatter tuning profiles.

      We agree that the phrase "shaped by associative processes" was ambiguous and will replace it with explicit language clarifying that we refer to fear conditioning as the associative process driving the emergence of graded responses, rather than any learning occurring during the test sessions themselves.

      (10) The following points all relate to the Discussion and reiterate many of the points above. 

      (a) 'A subset of neurons remains consistently active across sessions, preserving core components of the memory trace and supporting inference of emotional valence for novel sounds, while neurons recruited after conditioning progressively acquire valence selectivity at remote time points.'

      'Inference of emotional valence' is unclear and unwarranted for all of the reasons provided above regarding the use of language.

      We will modify the language as stated in the prior points.

      (b) '...Our data reconcile these views by demonstrating that cortical representations of emotional valence emerge rapidly after learning and persist within stable subnetworks, even as the broader population undergoes substantial turnover. This architecture preserves core mnemonic content while allowing flexibility in the surrounding ensemble.'

      These statements assume that the PL neuronal responses reflect something more than the levels of freezing behavior to the different stimuli; what are the grounds for this assumption?

      We will incorporate new analysis (GLM) to better address this point and conclusions.

      (c) 'Importantly, these subnetworks encode both learned contingencies and the inferred valence of novel stimuli along a graded representational axis, suggesting that strong recurrent connectivity provides a stable scaffold for emotional memory representations.'

      What is a graded representational axis, and what part of the first statement suggests that 'strong recurrent connectivity provides a stable scaffold for emotional memory representations'? If the authors' goal was to make statements about emotional memory representations vis-à-vis emotional memory content, they should have used protocols that allowed them to probe such content. The auditory fear conditioning protocol used here [followed by tests for generalization to other auditory stimuli that differ in frequency from the conditioned tone] is not one that lends itself to analysis of emotional memory representations or content.

      We thank the reviewer for this comment and agree that both phrases require clarification or revision.

      By "graded representational axis" we intended to convey that PL population activity varies systematically as a function of stimulus similarity to the conditioned tone — that is, population responses are not categorical but scale continuously with spectral proximity to the CS+. We agree this was not clearly stated and will revise the manuscript accordingly.

      Regarding recurrent connectivity, we agree with the reviewer that nothing in our data directly measures or manipulates connectivity between neurons. This statement was intended as a speculative interpretive hypothesis in the Discussion, motivated by the established literature linking strong recurrent connectivity in prefrontal circuits to stable population-level representations [5]. However, we acknowledge that invoking it in this context, without direct evidence, risks overstating our conclusions. We will revise this sentence to make its speculative nature explicit and ground it more carefully in the cited literature rather than presenting it as an inference from our own data.

      In summary, we will ensure our conclusions will be restricted to population-level coding of learned threat value and its generalization across auditory frequencies. We will revise the relevant passages in the Discussion to ensure that speculative interpretations regarding emotional memory content are either removed or clearly flagged as speculative hypotheses.

      (d) 'Dynamic tone-selective responsive neurons emerge independently of learning, as they are present in both control and experimental mice, reflecting pre-existing PL sensory-driven properties (Hockley & Malmierca, 2024; Zikopoulos & Barbas, 2006).'

      Maybe. They are also likely to have developed as a consequence of the repeated testing on days 1, 15, and 30, which involved intermixed exposures to the tones of different frequencies. That is, rather than 'pre-existing PL sensory-driven properties', the responses of these neurons might reflect the emergence of discrimination between the various tones across testing, and greater suppression of freezing to the non-trained tones compared to the trained tone across the various test intervals.

      We thank the reviewer for this point. Our interpretation that these neurons reflect pre-existing PL sensory-driven properties was based on the observation that tone-selective responses were present in control animals that never received conditioning, consistent with prior reports of sensory responsiveness in PL cortex ([6, 7]. Because these responses emerge from the first time we expose mice to the intermediate frequencies, they cannot be explained by repeated exposure. Moreover, we did not observe progressive refinement, emergence of discrimination-like changes, or suppression of responding to non-reinforced tones in control mice. This difference between conditioned and control animals indicates that repeated tone exposure alone is not sufficient to produce the observed dynamics — associative learning is necessary. We therefore maintain that the tone-selective responses of these neurons reflect pre-existing sensory-driven properties of PL cortex that are present independently of conditioning history.

      In summary, we thank the reviewer for suggesting clarifications to our interpretation, for raising the possibility that freezing behavior may contribute to graded neural responses, and for raising the question of whether repeated tone exposure may contribute to the properties of neurons recruited after conditioning. In the revised manuscript, we will include additional analyses to better dissociate the contributions of freezing behavior and tone identity, clarify passages that were insufficiently precise, and include a paragraph in the Discussion addressing potential alternative explanations alongside our own interpretation of the data.

      Reviewer #3 (Public review):

      Summary:

      Normandin et al. explore the coding of stimuli predicting an aversive event in the prelimbic cortex. Stimuli could either be explicitly paired, explicitly unpaired, or novel but with an inferred association with the aversive event (generalization). Long-term tracking of GCaMP-positive neurons allowed them to examine how coding evolves out to a month following training. In general, they found two types of ensemble codes. One was ensembles coding for each stimulus independently, but with enhanced responding to the one eliciting a freezing response. The other was ensembles that responded to all stimuli in proportion to their similarity to the stimulus paired with the aversive event, either increasing or decreasing their activation with the degree of freezing elicited by a stimulus. Importantly, this second set of ensembles was more stable across days, potentially providing a memory trace.

      Strengths:

      (1) The authors track ensembles in prelimbic cortex over long time scales, providing valuable information on the consolidation of neural codes.

      (2) Neural coding of generalization is examined, which is under-examined in the field.

      We thank the reviewer for appreciating our design to track ensembles over time and the relevance of studying the neural substrates of generalization.

      Weaknesses:

      (1) Difficult to determine if responses treated as encoding stimulus valence are driven instead by the behavior that the stimulus elicits, freezing.

      We thank the reviewer for this thoughtful and constructive comment. We agree that an alternative interpretation is that the graded-response ensembles may partially reflect freezing-related activity rather than mnemonic or salience-related representations of the conditioned stimuli themselves. In the revision, we will acknowledge that prior work has identified PL neurons that encode freezing independently of stimulus identity or associative content. Furthermore, we will implement the reviewer’s suggested generalized linear model (GLM) approach using inferred spiking activity derived from the Ca2+ signals. Specifically, we will include both stimulus identity and freezing behavior as predictors. Because freezing varies across trials whereas stimulus presentation is fixed, this analysis will allow us to dissociate the relative contributions of stimulus-related versus freezing-related activity to the graded neuronal responses. We thank the reviewer for this excellent suggestion.

      If graded stimulus coding remains significant after accounting for freezing behavior, this would strengthen the interpretation that these ensembles encode learned salience or associative properties of the stimuli rather than behavioral output alone. Conversely, if freezing explains a substantial proportion of the variance, we will revise our interpretation accordingly.

      (2) The study implies that the identified ensembles are causally related to valence memory, but no experimental interventions are performed to justify this.

      We appreciate the reviewer's point. We agree that our data are correlational in nature and that establishing a causal relationship between identified ensembles and valence memory would require experimental interventions such holographic two-photon manipulations, which are beyond the scope of the present study but represent an important direction for future work.

      To provide an indirect link between ensemble organization and behavior within the constraints of the current dataset, we will examine inter-individual variability in the revised manuscript. Specifically, we will test whether the proportion of neurons participating in stable graded-response ensembles versus dynamic stimulus-specific ensembles predicts individual differences in freezing behavior and fear generalization across retrieval sessions. If animals with a higher proportion of stable graded-response neurons show stronger discrimination and less generalization to non-conditioned tones, this would strengthen the association between ensemble organization and behavioral outcome, while remaining correlational in interpretation.

      We will modify the manuscript terminology accordingly, replacing causal language with phrasing that accurately reflects the associative nature of our conclusions.

      References

      (1) Aschauer, D.F., et al., Learning-induced biases in the ongoing dynamics of sensory representations predict stimulus generalization. Cell Rep, 2022. 38(6): p. 110340.

      (2) Kato, H.K., S.N. Gillet, and J.S. Isaacson, Flexible Sensory Representations in Auditory Cortex Driven by Behavioral Relevance. Neuron, 2015. 88(5): p. 1027–1039.

      (3) Vervliet, B., et al., Generalization gradients in human predictive learning: Effects of discrimination training and within-subjects testing. Learning and Motivation, 2011. 42(3): p. 210–220.

      (4) Dunsmoor, J.E. and K.S. LaBar, Effects of discrimination training on fear generalization gradients and perceptual classification in humans. Behav Neurosci, 2013. 127(3): p. 350–6.

      (5) Mante, V., et al., Context-dependent computation by recurrent dynamics in prefrontal cortex. Nature, 2013. 503(7474): p. 78–84.

      (6) Hockley, A. and M.S. Malmierca, Auditory processing control by the medial prefrontal cortex: A review of the rodent functional organisation. Hear Res, 2024. 443: p. 108954.

      (7) Zikopoulos, B. and H. Barbas, Prefrontal projections to the thalamic reticular nucleus form a unique circuit for attentional mechanisms. J Neurosci, 2006. 26(28): p. 7348–61.

    1. eLife Assessment

      This short report is an important study that visual acuity declines nonlinearly with cone dropout, while eye motion partially compensates by improving sampling from remaining cones. The method for experimentally simulating cone dropout is compelling, leveraging state-of-the-art imaging and testing in human subjects. Inclusion of additional analysis on absolute cone density and eye motion would further strengthen the study.

    2. Reviewer #1 (Public review):

      The authors demonstrate an innovative approach to investigate the effect of cone dropout on visual acuity using their newly developed olo system. By systematically reducing the coverage of real-world input to the cone photoreceptor mosaic ("cone dropout condition"), the authors are able to assess how having fewer cones leads to reduced vision, in comparison to existing approaches ("pixel dropout condition").

      The capture of a rich dataset, including cone imaging and eye motion, is valuable. Benchmarking with the prior literature, suggesting that good visual acuity can be maintained despite a 50% loss in cone density, is impressive. However, it is known that cone density varies dramatically from the peak cone density location in the foveal center to even a location a few degrees outside of the fovea. In addition, there is a high degree of subject-to-subject variation in peak cone density. Given that the C stimulus is hollow in the middle, the stimulus does not actually hit the location of the peak cone density but must land slightly outside of it. Therefore, considering the actual cone density of where the stimulus lands will be important to discuss and/or analyze.

      The observation of visual acuity maintenance with cone dropout has been a longstanding mystery since the 2013/2018 papers by Ratnam and Foote. The authors should be commended for their approach to addressing this important question. However, there are some simplifications and assumptions being applied to make this jump (i.e., that a 50% reduction in cone stimulation in a healthy eye is comparable to a 50% reduction in cone density in a patient). It seems unlikely that, in a patient's eye, with cone dropout, there will be gaps in the mosaic. Not considering any other non-photoreceptor-related reasons for visual acuity loss, which can occur in patients, the cone aperture acceptance angle may be different due to changes in cone size or packing; the sensitivity of individual cones may also be reduced due to deficits in the visual cycle recovery, which could be affected in disease. Some of these limitations could be addressed and acknowledged more explicitly.

      Overall, this is an impressive study incorporating state-of-the-art technology to probe the fundamental limits of human vision.

    1. eLife Assessment

      Fujita et al. examine the effects of AM-2099, a Nav1.7 inhibitor, on the excitability of human dorsal root ganglion neurons and compare these results to their prior study of Nav1.8 inhibition by suzetrigine. They show that the Nav1.7 inhibitor primarily alters action potential threshold and initiation, but not repetitive firing, whereas Nav1.8 inhibition elicits much stronger inhibition on repetitive firing. These complementary roles of Nav1.7 and Nav1.8 provide a plausible cellular explanation for the limited clinical success of Nav1.7 inhibitors compared to Nav1.8 inhibitors for chronic pain. While the conclusions are important and solid, there are some key shortcomings that should be addressed to strengthen the study.

    2. Reviewer #1 (Public review):

      Summary:

      Fujita and colleagues investigated two selective peripheral nerve voltage-gated sodium channel inhibitors targeting either Nav1.7 or Nav1.8 on the excitability of human dorsal root ganglion neurons. The authors discovered that Nav1.8 inhibition is more effective at suppressing repetitive firing of DRG neurons, and this may explain the greater clinical efficacy observed for suzetrigine.

      Strengths:

      The study is interesting, and the findings are conceptually satisfying in that they may explain one aspect of Nav1.7 vs Nav1.8 targeting success.

      Weaknesses:

      (1) The use of postmortem human DRG neurons provides translational relevance, but the use of these cells is also a liability, given their high degree of variability. Of note are the 10 to 20-fold differences in baseline properties among cells, which dwarf the effects of the test compounds. The experiments may suffer from undersampling.

      (2) A potential confounder when using post-mortem human DRG neurons is heterogeneity of cell types. The methods clearly state that the cells selected for recording were of 'generally' small size, but specific criteria for what constitutes 'small' or other unstated selection criteria were not provided. A table of individual cell capacitance and input resistance values, along with information about individual donors (age, sex, ethnicity), is important to include. Additionally, some discussion of how DRG neuron heterogeneity impacts the findings. This relates to concern #1 about sample size determination and how cell heterogeneity factored into this calculation.

    3. Reviewer #2 (Public review):

      Summary:

      The authors examine the functional role of Nav1.7 voltage-gated sodium channels in human sensory neuron electrogenesis using a Nav1.7 selective inhibitor and human dorsal root ganglion neurons obtained from organ donors. Patch-clamp electrophysiology is used at physiological temperature to measure the impact of Nav1.7 inhibition on sensory neurons' action potential firing. This is an important topic as Nav1.7 and Nav1.8 have been identified as therapeutic targets for the treatment of pain, but there has been mixed success with isoform-specific inhibitors in clinical trials. The data suggest that Nav1.7 and Nav1.8 have overlapping yet complementary functions in nociceptor neurons and that targeting both may be most effective for reducing nociception.

      Strengths:

      The data are of high quality. Action potential properties are measured at 37 degrees Celsius. Threshold is measured using brief pulses. The Nav1.7 inhibitor has been reported to be highly selective for Nav1.7 over Nav1.8 and moderately selective for Nav1.7 over Nav1.1 and Nav1.6. Data are collected using identical conditions and protocols to a previous study on the role of Nav1.8 in similar neurons.

      Weaknesses:

      The study relies on a single Nav1.7 inhibitor that has not been extensively characterized. One prior study indicates that the IC50 is around 140 nM, thus the 600 nM concentration used in this study could be predicted to reduce Nav1.7 currents by 80%. However, there is no voltage-clamp data in the current study to confirm this, and therefore, it is unclear if the batch of AM-2099 is as potent as reported in the paper that initially described its selectivity. The impact of Nav1.7 inhibition is compared to data from a previous study by this lab, and this is a minor concern. It would have been interesting to see if the combined inhibition of Nav1.7 and Nav1.8 completely blocked action potential generation in the human DRG neurons.

    4. Reviewer #3 (Public review):

      Summary:

      In this manuscript, Fujita/Jo/Stewart/Osorno et al. investigate the contribution of Nav1.7 in regulating the excitability and firing properties of human dorsal root ganglion (hDRG) neurons in vitro. The authors characterize the effects of a previously reported Nav1.7-selective blocker AM-2099 in cultured hDRG neurons from postmortem organ donors. The authors observed modest changes in many of the properties expected by inhibiting Nav channels, including decreased action potential upstroke rate and amplitude, while increasing the voltage and current thresholds for spike generation. However, AM-2099 did not change the maximum number of APs in response to suprathreshold stimulation, leading the authors to conclude that Nav1.7 inhibition alone has limited efficacy in reducing the firing properties of hDRG neurons and that Nav1.7 blockers may have limited efficacy as analgesics. This is surprising, given that patients with loss-of-function mutations in Nav1.7 suffer from congenital insensitivity to pain. While it may indeed be true that pharmacological inhibition of Nav1.7 is unlikely to produce analgesia, the present study was limited to a single concentration of AM-2099. The manuscript would be significantly strengthened by a more careful and thorough pharmacological characterization of this compound, which has not been widely used or validated in native human DRG neurons.

      Strengths:

      Experiments are well-designed and executed, and the results presented are convincing. The focus on voltage-gated sodium channels in native human DRG neurons is highly relevant to recent efforts to develop safer analgesic options for chronic pain in people.

      Weaknesses:

      Only a single concentration of AM-2099 was used for all experiments. This compound was reported to be selective for cloned human Nav1.7 channels in heterologous systems, but has not been validated in other studies after the original publication in 2016. Since the original study reported a substantial state-dependent block of recombinant Nav1.7 channels, more detailed pharmacological characterization of AM-2099 is needed in human DRG neurons to fully support these claims. This study would be significantly strengthened by the inclusion of dose-response curves to assess how much of the sodium current is inhibited at this concentration, confirming selectivity in hDRG, and whether maximal inhibition of Nav1.7 still has limited efficacy in reducing the firing of native human sensory neurons.

    1. eLife Assessment

      This valuable study analyses correlations between traits of Chinese frog species and their Red List status, finding differences between adults and larvae and thus pointing to the importance of considering different life-cycle stages in this and possibly other animal groups when assessing species extinction risks. The current study is, however, incomplete because of unclear threat categories for tadpoles, the omission of other key species traits, and insufficient statistical analysis.

    2. Reviewer #1 (Public review):

      The manuscript shows that different traits of adults and larvae correlate with Red List status. The authors argue that this shows a big gap in the conservation of amphibians and that the traits of all life stages should be taken into account in amphibian conservation. Specifically, amphibian conservation should do more for the habitats where the larvae live.

      The manuscript is well written and easy to understand. The methods are sound.

      While the study will make an interesting contribution to conservation science, there are many things that I disagree with.

      I don't think that amphibian larvae and their requirements are a "blind spot" as the title suggests. When reading the manuscript, I didn't learn how conservation practice should change in response to the results.

      I wonder whether the relationship between species traits and extinction risk is of great importance for conservation. If a species is Data Deficient on the IUCN Red List, then species traits could be used to predict its Red List category. However, for other conservation projects, I don't see how this would work. How would traits be linked to captive breeding, conservation translocation, pond construction or habitat management in general? In some cases, I can envision a link between species traits and pond hydroperiod.

      Species traits are body size and morphological traits. That makes sense. However, one of the species traits was microhabitat. I find it far-fetched to call habitat a species trait. This is standard habitat ecology. It is well known that habitats matter and that different habitat types face different threats, and consequently, the species that live in those habitats. Furthermore, habitat and morphology may be confounded. For example, tadpoles in lentic and lotic habitats have very different morphologies. So is it habitat or morphology?

      I don't know how the threat status of Chinese amphibians is determined. IUCN has multiple reasons why a species can be Red Listed. One reason is range size, and another reason is population decline. Personally, I don't think they should be pooled in an analysis because they are fundamentally different reasons why a species has a high extinction risk. A reduction in population size of greater than 30% in 10 years or 3 generations is not the same thing as a small distribution range. Another issue is that IUCN developed the Green Status of species. The Green Status shows that even a species which is LC on the Red List may be significantly depleted.

      The species traits in Table 1 are mostly functional/morphological and body size related (and microhabitat). While there may be correlations between traits and Red List status, it is unknown whether this is correlation or causation. In addition, it is difficult to know the conservation interventions that may be necessary now that we know that relative head with and Red List status are correlated.

      In the discussion, the authors explain why body size and other traits may affect extinction risk and whether there is a causal relationship. I agree that body size may have a direct effect because larger species are harvested more frequently (it was interesting to learn that tadpoles are harvested as well). However, as macroecological studies show, smaller species often have larger populations than larger species. Abundance may matter.

      I found it much harder to understand why relative head length and tympanum size correlated with Red List status. I wasn't convinced by the arguments in the discussion. Typanum size may be related to hearing and anthropogenic noise. Several studies are cited which show that frogs alter their calling behaviour in response to noise. Crucially, however, they describe changes in behaviour or properties of the advertisement call, yet none show that noise has effects on population viability. If some anthropogenic stressor affects individuals, then this does not mean that it will cause a population decline. When IUCN published the second global amphibian assessment, did they list noise as a major threat to amphibians?

      There are statements that the tadpole stage is the most important stage: "a critical period for amphibian survival" (line 78-79). While there is high mortality in the tadpole stage, tadpole survival is rather unlikely to affect population survival. Many population models show this. See, for example, Biek et al. 2002 in Conservation Biology. Other papers have argued that the postmetamorphic juvenile stage is most important (Petrovan and Schmidt 2009 Biological Conservation).

      The authors repeatedly make the statement that amphibian conservation should focus more on the tadpole stage. I don't understand why this statement is made. For example, a major activity in amphibian conservation is the restoration and de novo construction of ponds (see Calhoun et al. 2014 PNAS, Moor et al. 2022 PNAS). Ponds are habitats for tadpoles. Others removed fish from amphibian breeding sites because fish prey on tadpoles (and adults; see Vredenburg 2004 PNAS). Semlitsch (2002 in Conservation Biology) argued that the management of pond hydroperiod is a critical element of amphibian recovery plans. Ponds should be temporary because this effectively removes predators that consume tadpoles. Clearly, the tadpole stage is not a neglected stage in amphibian conservation.

    3. Reviewer #2 (Public review):

      Summary:

      In this study, the authors tried to examine whether there are differences in the association between functional traits and extinction risk in adult and tadpole stages in Chinese anurans.

      Strengths:

      Overall, I think the basic idea of the study is interesting and important. It can be applied to other taxa with complex life cycles throughout the animal kingdom.

      Weaknesses:

      I do not think the authors achieve their aims, as the results only partially support their conclusions. The study has several drawbacks that need to be clarified or revised, including the unclear threat categories for tadpoles, model selection and model averaging, the potential problem of AIC, and the omission of other important species traits.

    1. eLife Assessment

      This work provides a fundamental advance through a detailed and integrative analysis of how the tsetse fly feeds on blood, demonstrating that successful penetration depends on subtle structural adaptations rather than extreme forces or unusual anatomy. By combining high-resolution imaging, innovative biomechanical measurements, and experiments on artificial skin, the study offers complementary and compelling evidence, with clear data supporting a robust mechanistic interpretation. These findings have broad significance as they clarify the biomechanics of vector feeding with implications for the transmission of diseases such as African trypanosomiasis across diverse hosts.

    2. Reviewer #1 (Public review):

      Summary:

      This manuscript provides a comprehensive and mechanistic analysis of how tsetse flies feed on blood across a wide range of host skin types. The authors combine detailed anatomical characterization of the feeding apparatus with quantitative measurements of mechanical properties, probing forces, and blood uptake, complemented by experiments using artificial skin. They show that tsetse flies do not rely on extreme forces or uniquely specialized structures, but instead on subtle and highly efficient structural and mechanical adaptations (such as the toothed labellum and coordinated proboscis movements) to achieve effective blood pool feeding. The study successfully moves beyond descriptive anatomy to a quantitative, functional analysis that explains how feeding is accomplished across diverse substrates.

      Strengths:

      A major strength of the work is the impressive integration of multiple complementary approaches. Advanced imaging tools provide a convincing three-dimensional view of the proboscis, labellum, and associated structures, while direct force measurements and blood intake quantification place these observations on a solid quantitative footing. The use of artificial skin with different mechanical properties is particularly powerful, as it allows structure-function relationships to be tested under controlled and reproducible conditions. Together, these datasets provide strong and coherent support for the authors' central conclusions. The quantitative treatment of feeding mechanics represents a significant advance over largely descriptive prior work by others (e.g., Gibson W et al 2017) and establishes a valuable mechanistic insight for studying blood feeding in insect vectors more broadly.

      Weaknesses:

      The study focuses almost entirely on uninfected flies and does not address how infection might alter feeding mechanics or performance. Previous work has shown that trypanosome infection can affect salivary gland function and feeding time (Van Den Abbeele et al 2010), and even cause damage to mouthparts, all of which can influence feeding behavior and efficiency. While this does not detract from the technical quality or the core findings of the study, a more explicit discussion of these biological variables would help place the results in a broader transmission-relevant context and clarify how generalizable the conclusions are to natural infection settings.

      Overall, this is an outstanding and carefully executed study that will have a significant impact on the fields of vector biology and parasite transmission.

    3. Reviewer #2 (Public review):

      Summary:

      This manuscript presents an impressively detailed, multidisciplinary analysis of the mechanics of blood feeding in Glossina spp. Combining SEM, CLSM, µCT, FIB‑SEM, macro‑videography, and quantitative force measurements, the authors characterize the structures and biomechanics of attachment, proboscis deployment, tissue penetration, and blood uptake. They also examine interactions with diverse host‑type substrates, from human skin equivalents to cow, deer, and lizard skin, and integrate these with force measurements to quantify penetration and retraction dynamics.

      The work's key conclusion is that the tsetse fly does not rely on any single exceptional morphological innovation, but rather uses a suite of subtle structural features and retractive forces to feed efficiently across diverse hosts. This result is novel, insightful, and evolutionarily compelling. Overall, this is a strong manuscript that combines methodological sophistication with biological relevance. It should be of high interest to researchers studying vector biology, biomechanics, parasite transmission, and vector-host interactions.

      Strengths:

      (1) The combination of SEM, CLSM, µCT, and FIB‑SEM provides an unusually comprehensive anatomical characterization of the tsetse feeding apparatus.

      (2) The direct measurement of proboscis penetration and retraction forces across diverse substrates is highly original and fills a major knowledge gap in vector-host interaction mechanics.

      (3) The study bridges morphology, mechanics, behavior, and host tissue properties, which strengthens the overall conclusions.

      (4) Imaging of trypanosomes within the hypopharynx and surrounding tissue during feeding provides new information about parasite delivery mechanisms.

      Main Comments:

      (1) The authors conclude that feeding versatility arises from the sum of subtle adaptations. This interpretation is reasonable, but it would help to sharpen which findings most robustly support this statement. For example, the relative similarity of proboscis forces across skin types is compelling evidence that the proboscis is broadly tuned rather than specialized. The observation that tsetse targets softer interscale regions on lizard skin suggests behavioural selectivity, not morphological specialisation. It would strengthen the discussion to highlight which data most directly refute the hypothesis of a unique specialization.

      (2) A central finding is that retraction forces exceed penetration forces across substrates, implying that backward pulling is a key component of wound creation. However, the biological interpretation could be deepened. Specifically, do the authors believe retraction serves primarily to enlarge the pool‑feeding site? How does this compare mechanically to mosquito fascicle oscillation or other blood‑feeding arthropods (especially other flies such as those in the tabanidae family)? Could retraction forces contribute to anchoring or resisting host grooming behaviors?

      (3) The study analyzes a diverse set of substrates, which is a strength. However, some caveats deserve explicit discussion. Human skin equivalents and dermal equivalents lack the full mechanical complexity of real skin (e.g., innervation, perfusion, tension). Frozen or ethanol‑stored samples, particularly reptile skin, may also exhibit altered mechanical properties compared to live tissues. These limitations do not undermine the findings but should be explicitly acknowledged as they influence the interpretation of absolute force magnitudes.

      (4) The SEM and FIB‑SEM images showing trypanosomes in the hypopharynx and surrounding tissue during penetration are visually striking and suggest rapid dispersal. It would be helpful to connect these observations more clearly to the kinetics of parasite deposition and whether mechanical tissue laceration is likely to increase inoculation efficiency. Without conducting additional experiments, the authors could discuss whether these findings support or modify existing models of salivary-gland-derived parasite release.

      (5) The authors demonstrate that tsetse attachment abilities fall within the range of generalist insects and are far lower than those of obligate ectoparasites. However, the manuscript could discuss how attachment forces relate to the tsetse's ecological context, e.g., whether their attachment is generally brief, whether host shaking strongly selects for grip strength, etc. Is there evidence that other Glossina species or tabanids with different host preferences show variation in attachment performance? This would broaden the relevance of the findings.

      (6) In video 4, could the authors clarify whether the observed maxillary vibrations are hypothesized to reduce penetration resistance or serve another function?

    4. Reviewer #3 (Public review):

      Summary:

      Human and animal trypanosomiasis are fatal illnesses caused by African trypanosomes transmitted by tsetse flies during a bloodmeal. Thus, tsetse fly feeding is the key physical step in disease transmission to mammals. Tsetse fly feeding is not a new story, but it is revisited here through the application of sophisticated imaging techniques and novel biomechanical methods of analysis. The authors aim to provide a high-resolution picture of the structures and forces involved in feeding to provide mechanistic insights into the process of feeding, from attachment, penetration, drinking and retraction of the feeding parts.

      Largely, the authors have achieved their aims. They (i) examine the structures and forces involved in attachment; (ii) they provide detailed multi image analysis of the proboscis providing insights into its probing ability and physical mechanism of penetration; (iii) they conduct a controlled analysis of the physical forces involved in penetration and report that they are in the low nM range, not especially strong but much higher that the mosquito bite and finally they provide a first analysis of blood uptake during feeding.

      Strengths:

      The study images the tsetse fly feeding structures in unprecedented detail, with resolution to the uM scale, in 3-D, and during feeding. The resulting images are dramatic and insightful (and beautiful and frightening!), so researchers interested in trypanosomes, tsetse flies, or blood feeding by flies in general will want to see.

      They conclude that flies attach strongly to smooth surfaces because of interactions possible via the array of acanthae of the pulvillus pad at the ends of the tarsi. The estimated attachment forces are similar in male & female flies, in the low mM range (they look impressively strong in video 1). They provide a very striking analysis of the proboscis and labellum and associated tooth structures (Figures 4 & 5). I recall many years ago observing that tsetse flies are messy feeders, and these structures, especially the rasping teeth structures on the reverse folded labial tips, explain why! This seems more like a chainsaw than a jigsaw in action, but the authors are probably correct that these structures and the probing/retraction mechanism explain many features of tsetse fly feeding and their ability to feed on a wide range of hosts with very different skin types.

      The impressive aspect of this paper is the range of imaging techniques (CLSM, SEM, uCT, FIB SEM), the quality of the images, which attests to the obvious care taken with sample preparation. The biomechanical analysis, especially the penetration analysis, is impressive. Finally, the paper is clearly written and presented; it was a very easy read and, overall, a very engaging study.

      Weaknesses:

      I suppose it could be said that the paper is a descriptive study; it doesn't really test a hypothesis, but that is not a prerequisite for sharing it. Perhaps the least convincing parts are the imaging of the flexible versus rigid parts of the structures, which is based on the amount of resilin (flexible) and chitin-protein (stiff), based on their autofluorescence. It seems odd that the joints would be less blue (stiffer) in Figure 1i, or what the blue structures correspond to in Figure 6B-D.

    1. eLife Assessment

      This useful work addresses a longstanding question of how the extant genetic code came to be selected and conserved almost universally across life. Using a mutational approach and a small set of reporters, the authors demonstrate that the mutational impact was similar for non-standard genetic codes. Considering the limitations of the approach, the data are incomplete in supporting the claim of having provided 'experimental verification of the error minimization theory'.

    2. Reviewer #1 (Public review):

      In this manuscript, the authors investigate the relationship between genetic codes and their robustness to single-point mutations. They construct ten alternative genetic codes by reassigning nine codons to Leu, Ser, or Ala, and assess mutational robustness using three reporter proteins subjected to error-prone PCR. This represents an interesting experimental approach to addressing the hypothesis that the standard genetic code is optimized for mutational robustness.

      Major comment:

      While I find the experimental design valuable, I am not fully convinced by the authors' conclusion that "alterations of the genetic code within the ranges explored in this study have no significant effect on mutational robustness". The current analysis is based on the functional output of three individual reporter proteins. Given that cellular systems involve far more complex interactions, it would be more appropriate to limit this conclusion to mutational robustness at the level of individual protein activity, rather than making broader generalizations.

      Specific comments:

      (1) tRNA modification and expression efficiency (Page 5, line 131).

      The authors attribute the observed inefficiency to the lack of chemical modifications in the tRNAs used. However, gene expression efficiency can also be strongly influenced by DNA sequence design. To better support this claim, it would be helpful to compare luciferase activity when expressed using native E. coli tRNAs. This comparison could clarify whether the observed effects are due to tRNA modification status or other sequence-dependent factors.

      (2) Discrepancy between expression level and activity (Figure S7 vs Figure S8).

      Although GAL expression levels appear similar across different genetic codes (Figure S7), their activities differ substantially (Figure S8), even in the low-mutation library. This discrepancy warrants further investigation. Possible explanations include differences in protein folding efficiency or translational error rates, as mentioned by the authors in the main text.

      To address this, the authors could analyze the protein products using mass spectrometry. If this is not feasible due to low expression levels, alternative approaches such as SDS-PAGE (e.g., with radiolabeling or Western blotting) would still provide valuable information. Additionally, comparing activity after in vitro refolding could help distinguish between folding defects and sequence-level errors. While I understand that the primary aim of this study is to compare mutational robustness across genetic codes, discussing these observations would significantly enhance the mechanistic insight of the work.

      (3) Protein expression analysis for additional reporters.

      Since protein expression levels are critical for interpreting reporter activity, similar analyses should also be performed for luciferase (Luc) and mSG in both high- and low-mutation libraries. This would ensure that differences in activity are not confounded by variations in protein abundance.

    3. Reviewer #2 (Public review):

      Summary:

      The study addresses the long-standing question in molecular biology and genetics: why has nature selected the current genetic code (SGC, or standard genetic code)? The authors have tested 'error minimization theory', one of the prevailing hypotheses to explain this. Their approach is to create a minimum genetic code (MGC) and its variants (3^9 theoretical possible codes). Using three parameters to quantify the effect of mutations (Polarity, volume, and hydropathy), they computationally test the cost of these genetic codes (3^9) by simulations. Finally, they test this cost experimentally using an in vitro translation system with 10 select genetic code variants with a range of costs (low to high). They use three randomly mutated reporter genes for this purpose - beta-galactosidase, luciferase, and mSG. They find no correlation between the cost of the genetic code and the reporters' output. Based on these observations, they suggest that error-minimization theory may not explain the current egocentric code.

      The question they are asking is very exciting, and their approach is solid. The authors are very careful in their analyses and conclusions.

      Major Concerns:

      (1) The rationale for using MGC instead of SGC: It is unclear why the authors rely on the MGC for this analysis when the central question concerns the SGC. If the goal is to evaluate whether the SGC minimizes mutational cost, a more direct approach would be to generate alternative variants of the SGC itself and compare their mutational cost distributions. At present, it is difficult to assess whether conclusions drawn from this comparison are fully relevant to the stated biological question.

      (2) The mutational cost analysis appears biologically oversimplified because all amino acid substitutions are treated equivalently. The analysis assumes that all mutations contribute equally to fitness consequences, which does not reflect biological reality. In natural proteins, the impact of an amino acid substitution depends strongly on its structural and functional context. For example, substitutions affecting catalytic residues, ligand-binding interfaces, phosphorylation sites, or other regulatory motifs can severely impair protein function even when associated changes in polarity, hydropathy, or volume are minimal. Conversely, substitutions in structurally permissive or functionally dispensable regions may have little or no measurable effect despite larger physicochemical differences. Therefore, changes in polarity, hydropathy, and volume alone do not necessarily predict functional consequences.

      (3) It is not clear why they increased the concentration of the two tRNAs in near-SGC. Have they maintained the same tRNA concentrations in experiments explained in Fig 5 for all 10 genetic codes tested?

    4. Reviewer #3 (Public review):

      Summary:

      In this manuscript, Miyachi and Ichihashi investigate whether the arrangement of the genetic code affects mutational robustness. Using an in vitro minimal genetic code with vacant codons, they constructed 10 non-standard genetic codes by reassigning Ala, Ser, and Leu, generating codes with replacement costs that were generally higher than those of the standard genetic code across several amino acid property measures. They then tested how random mutations affected the activity of reporter proteins translated under these altered codes. Although error minimization theory predicts that higher-cost codes should make mutations more harmful, the authors report that protein function declined to a similar extent across all codes examined, suggesting that mutational robustness remains largely unchanged within the range of genetic code alterations tested here.

      Strengths:

      This is an interesting study that investigates one of the most fundamental and intriguing questions in molecular evolution: the emergence of the genetic code, which is nearly universal across nature. The in vitro approach is a powerful aspect of the work and provides an opportunity to examine this phenomenon experimentally at a depth that has previously been inaccessible.

      Weaknesses:

      However, the authors' use of random mutation libraries has certain limitations that prevent the study from realizing its full potential to uncover the mechanisms governing the molecular evolution of the genetic code.

      Major points:

      (1) Statistical analyses are missing for several of the manuscript's main claims. This issue applies throughout the paper, including, but not limited to, Figures 1D, 2B, 4B-D, and 5B.

      (2) In Figure 2A, the authors modify the NanoLuc gene by reassigning Ala, Leu, or Ser to new codons and elegantly show that the in vitro availability of the corresponding tRNAs is important for protein function. However, the functional importance of the specific modified positions within NanoLuc is not clear. As a result, it is difficult to determine what the expected consequences of these codon changes should be, which in turn limits the interpretation of the observed changes in protein activity. To improve the interpretability of this experiment, the authors should report exactly how many codons were modified in each variant and, ideally, examine the effect of progressively increasing the number of reassigned codons.

      (3) The calculations presented in Figure 3 raise an interesting conceptual question: why does the near-standard genetic code not exhibit the lowest cost? One possible explanation is that the standard genetic code evolved under multiple competing constraints and is therefore not expected to be optimal for any single cost metric, while still achieving strong overall performance. In this context, it would be informative if the authors combined the three cost measures into a single integrated index and examined whether the near-SGC performs more favorably when all three dimensions are considered together. Such an analysis could add important depth to the study.

      (4) It is difficult to assess the consequences of the random mutations presented in Figure 4 on reporter gene function based solely on the reported "error rate/base" parameter. In particular, the x-axis in Figure 4B should be converted into the estimated number of mutations per gene. This would make the results more intuitive and would allow the reader to better evaluate the expected degree of disruption to protein function.

      (5) A central limitation of the random mutagenesis libraries used in Figure 5, which also underlie one of the manuscript's main claims, is that the exact mutations and their distribution across the reporter genes are not reported. In addition, protein activity is measured only at the level of the entire library, without directly linking individual mutations to their functional consequences. This substantially limits mechanistic interpretation. In my view, this issue can only be addressed convincingly if the authors test a set of defined variants carrying specific mutations and directly evaluate their functional effects.

      (6) Related to the previous point, in Figures 5C, 5E, and 5G, the authors present the ratio between low-mutation-rate and high-mutation-rate libraries. However, because each library contains a different collection of mutations, it is unclear what can be inferred from these comparisons. To overcome this limitation, the authors should assess the effects of altered genetic codes on specific, defined mutations rather than on heterogeneous mutation pools alone.

      (7) Along the same lines, in Figures 5C, 5E, and 5G, it is unclear why the effects of random mutations would be expected to correlate with the three calculated cost metrics, given that the positions, identities, and functional relevance of the mutations within the genes are not known. Without this information, the biological meaning of these correlations remains difficult to evaluate.

      (8) For each mutagenesis library, the number of variants, the average number of mutations per variant, and the distribution of mutation positions should be reported clearly and transparently. These details are important for evaluating the strength of the conclusions.

      (9) Because only three amino acids were manipulated in the non-standard genetic codes, it remains unclear whether these particular amino acids occupy positions in the reporter proteins that are especially important for function and therefore likely to generate strong phenotypic effects. More broadly, it is not clear whether the assay is sufficiently sensitive to detect the effects of only a subset of deleterious variants within a pooled library. This point should be addressed more explicitly.

    1. eLife Assessment

      This important study fills a major geographic and temporal gap in understanding Paleocene mammal evolution in Asia and proposes an intriguing "brawn before bite" hypothesis grounded in diverse analytical approaches. The work rests on a solid methodological base. Some limitations remain, including uncertainty introduced by pooling different tooth positions, limited dietary interpretation, and the predominantly herbivorous taxonomic focus, which narrows the ecological scope of the conclusions. However, the manuscript provides a substantially strengthened and well-supported contribution, while appropriately inviting further work to clarify dietary trends, broader ecological context, and links between dental trait evolution and environmental change.

    2. Reviewer #2 (Public review):

      Summary:

      This study uses dental traits of a large sample of Chinese mammals to tract evolutionary patterns through the Paleocene. It presents and argues for a 'brawn before bite' hypothesis -- mammals increased in body size disparity before evolving more specialized or adapted dentitions. The study makes use of an impressive array of analyses, including dental topographic, finite element, and integration analyses, which help to provide a unique insight into mammalian evolutionary patterns.

      Strengths:

      This paper helps to fill in a major gap in our knowledge of Paleocene mammal patterns in Asia, which is especially important because of the diversification of placentals at that time. The total sample of teeth is impressive and required considerable effort for scanning and analyzing. And there is a wealth of results for DTA, FEA, and integration analyses. Further, some of the results are especially interesting, such as the novel 'brawn before bite' hypothesis and the possible link between shifts in dental traits and arid environments in the Late Paleocene. Overall, I enjoyed reading the paper and I think the results will be of interest to a broad audience.

      Weaknesses:

      For the original draft of the manuscript, I had four major concerns with the study, especially related to the sampling, diet, and evidence for the 'brawn before bite' hypothesis. I still believe that the original issues that I raised may be weaknesses of the study. For example, there is still limited discussion on diets (even though the dental topographic analyses used in the study are designed for inferring diets). And I find the results a little challenging to interpret because teeth of multiple positions are included in the same samples, which seems problematic. That said, the authors have addressed each of my previous concerns and have made major revisions, including running new analyses, and thus I support the paper.

    3. Author response:

      The following is the authors’ response to the original reviews

      eLife Assessment

      This important study fills a major geographic and temporal gap in understanding Paleocene mammal evolution in Asia and proposes an intriguing "brawn before bite" hypothesis grounded in diverse analytical approaches. However, the findings are incomplete because limitations in sampling design - such as the use of worn or damaged teeth, the pooling of different tooth positions, and the lack of independence among teeth from the same individuals - introduce uncertainties that weaken support for the reported disparity patterns. The taxonomic focus on predominantly herbivorous clades also narrows the ecological scope of the results. Clarifying methodological choices, expanding the ecological context, and tempering evolutionary interpretations would substantially strengthen the study.

      We have now thoroughly revised our manuscript in response to the editor and reviewer’s comments. In particular with regard to:

      (1) Sampling design: we clarified our methods section to indicate that we did not use worn or broken teeth in our initial analyses. We added the following sentence around line 690:

      “These tooth positions were selected from a broader examination of ~300 individual teeth from 72 specimens. We vetted the specimens and excluded 99 tooth positions (~33% of teeth initially chosen for possible inclusion) from our analyses because they either (1) were partially or completely broken at the crown, (2) were in an advanced stage of attritional wear where no cusps could be identified, or (3) possessed a combination of the two aforementioned conditions.”

      (2) Pooled versus by-tooth position analyses: we repeated the three major analyses (DTA & FEA variability through time, tooth size and variability through time, and DTA-FEA correlation through time) for individual molars (upper M1-3, lower m1-3) and select premolars (upper P3-P4 and lower p4; lower and upper p2 samples contained fewer than 5 specimens across the three time intervals, lower p3 contained only 2 specimens for the middle Paleocene, so they were excluded from the sub-partition analyses).

      For DTA & FEA variability through time (summarized as a new figure, Fig. S5, also pasted below), OPCR, DNE, and FEA trait data are supported in 78-100% of the per-tooth analyses for both the early-middle Paleocene and middle-late comparisons. By contrast, RFI and Slope data are replicated in only 22-56% of the per-tooth analyses. We qualified the main text reporting and discussion to include these sensitivity analyses so readers can assess nuances in the data when comparing pooled sample versus per-tooth analyses.

      For tooth size and variability through time (summarized in a new table, Table S3, also pasted below), we observed broad concordance in the pooled analyses and the per-tooth partitioned analyses. Different tooth positions provide strong support for different aspects of the observed trends, with the lower fourth premolar being the strongest driver of the overall trend. All of the significant trends in per-tooth analyses are in the same direction (i.e., decreasing size disparity and size mean through time) as the pooled sample. We added qualifying clarification in the text to bring attention to these refined results.

      For DTA-FEA correlation through time, we generated per-tooth correlation plots in three new figures (Figs. S9-11, only Fig. S10 shown here as an example). We observed that upper M1 patterns general reflect the trend recovered from analysis of the overall dataset, but M2 and M3 results display inconsistent DTA-FEA correlations, possibly due to small sample sizes. Lower molar patterns generally replicate those recovered in the overall analyses, but lower M1 and M2 signals appear to be stronger than those for lower M3. Finally, low sample sizes make premolar correlations unstable, with general pattern showing EP-MP strengthening then MP-LP stasis or weakening. Given these findings, it appears that the results in the pooled sample correlation plots are mainly driven by lower molar signals. It is not possible to conclude the other tooth position display different patterns because of the limited sample sizes.

      (3) Ecological scope of the study: although carnivorans and mesonychids are recorded from some of the time intervals examined in this study, our sampling choice of pantodonts and anagalids reflects the high abundance of available dental specimens in those clades, permitting us to make the strongest statistical inference given the incomplete fossil record. Additionally, all sampled taxa come from archaic clades that have not been determined to be specifically herbivorous; we included an additional paragraph in the introduction to explain this:

      “A major challenge with expanding analyses of post K-Pg recovery to Paleocene mammal assemblages elsewhere in the world is the generally stratigraphically limited nature of early Cenozoic sequences. In Asia, Paleocene localities in China represent the best studied to date[11]. From the earliest Paleocene, highly regional and endemic faunas are known from a handful of sedimentary basins (Fig. S1A). Among the faunal elements, only the archaic clades Anagalida and Pantodonta are consistently sampled across the major subdivisions of the Paleocene[11]. An additional complication with ecomorphological analysis of these early mammals is the uncertainty in their dietary ecology, as they are beyond the reach of conventional phylogenetic bracketing approaches to dietary reconstruction. Phenomic analysis of the placental radiation supports insectivory as the ancestral diet of the hypothetical placental ancestor, but uncertainty in the post K-Pg availability of insects and plants in some regions leave some doubt as to the accuracy of this ancestral state reconstruction[1]. Herein we treat the archaic Paleocene taxa in our analyses as having generalized diets rather than categorizing them as insectivores, herbivores, or carnivores.”

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      This work provides valuable new insights into the Paleocene Asian mammal recovery and diversification dynamics during the first ten million years post-dinosaur extinction. Studies that have examined the mammalian recovery and diversification post-dinosaur extinction have primarily focused on the North American mammal fossil record, and it's unclear if patterns documented in North America are characteristic of global patterns. This study examines dietary metrics of Paleocene Asian mammals and found that there is a body size disparity increase before dietary niche expansion and that dietary metrics track climatic and paleobotanical trends of Asia during the first 10 million years after the dinosaur extinction.

      Strengths:

      The Asian Paleocene mammal fossil record is greatly understudied, and this work begins to fill important gaps. In particular, the use of interdisciplinary data (i.e., climatic and paleobotanical) is really interesting in conjunction with observed dietary metric trends.

      Weaknesses:

      While this work has the potential to be exciting and contribute greatly to our understanding of mammalian evolution during the first 10 million years post-dinosaur extinction, the major weakness is in the dental topographic analysis (DTA) dataset.

      There are several specimens in Figure 1 that have broken cusps, deep wear facets, and general abrasion. Thus, any values generated from DTA are not accurate and cannot be used to support their claims. Furthermore, the authors analyze all tooth positions at once, which makes this study seem comprehensive (200 individual teeth), but it's unclear what sort of noise this introduces to the study. Typically, DTA studies will analyze a singular tooth position (e.g., Pampush et al. 2018 Biol. J. Linn. Soc.), allowing for more meaningful comparisons and an understanding of what value differences mean. Even so, the dataset consists of only 48 specimens. This means that even if all the specimens were pristinely preserved and generated DTA values could be trusted, it's still only 48 specimens (representing 4 different clades) to capture patterns across 10 million years. For example, the authors note that their results show an increase in OPCR and DNE values from the middle to the late Paleocene in pantodonts. However, if a singular tooth position is analyzed, such as the lower second molar, the middle and late Paleocene partitions are only represented by a singular specimen each. With a sample size this small, it's unlikely that the authors are capturing real trends, which makes the claims of this study highly questionable.

      With regard to sampling design: we clarified our methods section to indicate that we did not use worn or broken teeth in our initial analyses. We added the following sentence around line 690:

      “These tooth positions were selected from a broader examination of ~300 individual teeth from 72 specimens. We vetted the specimens and excluded 99 tooth positions (~33% of teeth initially chosen for possible inclusion) from our analyses because they either (1) were partially or completely broken at the crown, (2) were in an advanced stage of attritional wear where no cusps could be identified, or (3) possessed a combination of the two aforementioned conditions.”

      With regard to pooled versus by-tooth position analyses: we repeated the three major analyses (DTA & FEA variability through time, tooth size and variability through time, and DTA-FEA correlation through time) for individual molars (upper M1-3, lower m1-3) and select premolars (upper P3-P4 and lower p4; lower and upper p2 samples contained fewer than 5 specimens across the three time intervals, lower p3 contained only 2 specimens for the middle Paleocene, so they were excluded from the sub-partition analyses).

      For DTA & FEA variability through time (summarized as a new figure, Fig. S5, also pasted below), OPCR, DNE, and FEA trait data are supported in 78-100% of the per-tooth analyses for both the early-middle Paleocene and middle-late comparisons. By contrast, RFI and Slope data are replicated in only 22-56% of the per-tooth analyses. We qualified the main text reporting and discussion to include these sensitivity analyses so readers can assess nuances in the data when comparing pooled sample versus per-tooth analyses.

      For the tooth size and variability through time (summarized in a new table, Table S3, also pasted below), we observed broad concordance in the pooled analyses and the per-tooth partitioned analyses. Different tooth positions provide strong support for different aspects of the observed trends, with the lower fourth premolar being the strongest driver of the overall trend. All of the significant trends in per-tooth analyses are in the same direction (i.e., decreasing size disparity and size mean through time) as the pooled sample. We added qualifying clarification in the text to bring attention to these refined results.

      For DTA-FEA correlation through time, we generated per-tooth correlation plots in three new figures (Figs. S8-10, only Fig. S9 shown here as an example). We observed that upper M1 patterns general reflect the trend recovered from analysis of the overall dataset, but M2 and M3 results display inconsistent DTA-FEA correlations, possibly due to small sample sizes. Lower molar patterns generally replicate those recovered in the overall analyses, but lower M1 and M2 signals appear to be stronger than those for lower M3. Finally, low sample sizes make premolar correlations unstable, with general pattern showing EP-MP strengthening then MP-LP stasis or weakening. Given these findings, it appears that the results in the pooled sample correlation plots are mainly driven by lower molar signals. It is not possible to conclude the other tooth position display different patterns because of the limited sample sizes.

      Reviewer #2 (Public review):

      Summary:

      This study uses dental traits of a large sample of Chinese mammals to track evolutionary patterns through the Paleocene. It presents and argues for a 'brawn before bite' hypothesis - mammals increased in body size disparity before evolving more specialized or adapted dentitions. The study makes use of an impressive array of analyses, including dental topographic, finite element, and integration analyses, which help to provide a unique insight into mammalian evolutionary patterns.

      Strengths:

      This paper helps to fill in a major gap in our knowledge of Paleocene mammal patterns in Asia, which is especially important because of the diversification of placentals at that time. The total sample of teeth is impressive and required considerable effort for scanning and analyzing. And there is a wealth of results for DTA, FEA, and integration analyses. Further, some of the results are especially interesting, such as the novel 'brawn before bite' hypothesis and the possible link between shifts in dental traits and arid environments in the Late Paleocene. Overall, I enjoyed reading the paper, and I think the results will be of interest to a broad audience.

      Weaknesses:

      I have four major concerns with the study, especially related to the sampling of teeth and taxa, that I discuss in more detail below. Due to these issues, I believe that the study is incomplete in its support of the 'brawn before bite' hypothesis. Although my concerns are significant, many of them can be addressed with some simple updates/revisions to analyses or text, and I try to provide constructive advice throughout my review.

      (1) If I understand correctly, teeth of different tooth positions (e.g., premolars and molars), and those from the same specimen, are lumped into the same analyses. And unless I missed it, no justification is given for these methodological choices (besides testing for differences in proportions of tooth positions per time bin; L902). I think this creates some major statistical concerns. For example, DTA values for premolars and molars aren't directly comparable (I don't think?) because they have different functions (e.g., greater grinding function for molars). My recommendation is to perform different disparity-through-time analyses for each tooth position, assuming the sample sizes are big enough per time bin. Or, if the authors maintain their current methods/results, they should provide justification in the main text for that choice.

      With regard to pooled versus by-tooth position analyses: we repeated the three major analyses (DTA & FEA variability through time, tooth size and variability through time, and DTA-FEA correlation through time) for individual molars (upper M1-3, lower m1-3) and select premolars (upper P3-P4 and lower p4; lower and upper p2 samples contained fewer than 5 specimens across the three time intervals, lower p3 contained only 2 specimens for the middle Paleocene, so they were excluded from the sub-partition analyses).

      For DTA & FEA variability through time (summarized as a new figure, Fig. S5, also pasted below), OPCR, DNE, and FEA trait data are supported in 78-100% of the per-tooth analyses for both the early-middle Paleocene and middle-late comparisons. By contrast, RFI and Slope data are replicated in only 22-56% of the per-tooth analyses. We qualified the main text reporting and discussion to include these sensitivity analyses so readers can assess nuances in the data when comparing pooled sample versus per-tooth analyses.

      For the tooth size and variability through time (summarized in a new table, Table S3, also pasted below), we observed broad concordance in the pooled analyses and the per-tooth partitioned analyses. Different tooth positions provide strong support for different aspects of the observed trends, with the lower fourth premolar being the strongest driver of the overall trend. All of the significant trends in per-tooth analyses are in the same direction (i.e., decreasing size disparity and size mean through time) as the pooled sample. We added qualifying clarification in the text to bring attention to these refined results.

      For DTA-FEA correlation through time, we generated per-tooth correlation plots in three new figures (Figs. S8-10, only Fig. S9 shown here as an example). We observed that upper M1 patterns general reflect the trend recovered from analysis of the overall dataset, but M2 and M3 results display inconsistent DTA-FEA correlations, possibly due to small sample sizes. Lower molar patterns generally replicate those recovered in the overall analyses, but lower M1 and M2 signals appear to be stronger than those for lower M3. Finally, low sample sizes make premolar correlations unstable, with general pattern showing EP-MP strengthening then MP-LP stasis or weakening. Given these findings, it appears that the results in the pooled sample correlation plots are mainly driven by lower molar signals. It is not possible to conclude the other tooth position display different patterns because of the limited sample sizes.

      Also, I think lumping teeth from the same specimen into your analyses creates a major statistical concern because the observations aren't independent. In other words, the teeth of the same individual should have relatively similar DTA values, which can greatly bias your results. This is essentially the same issue as phylogenetic non-independence, but taken to a much greater extreme.

      It seems like it'd be much more appropriate to perform specimen-level analyses (e.g., Wilson 2013) or species-level analyses (e.g., Grossnickle & Newham 2016) and report those results in the main text. If the authors believe that their methods are justified, then they should explain this in the text.

      Based on the per-tooth partition analyses we performed and reported above, the results now show that the overall trends described in the previous draft of the study is a composite of signals from different regions of the dentition. For example, the OPCR, DNE, and FEA trends persist across most tooth positions, whereas the Slope and RFI trends are mainly driven by lower fourth premolar patterns. The tooth size results are also mainly driven by lower fourth premolar patterns, but tooth disparity trends are broadly supported across tooth positions. These observations indicate that the overall trends remain valid, but there are nuances as to which tooth positions are driving which components of the trends. As such, we deem the overall results to be valid, and focused our revision on providing the nuances so readers can assess through-time patterns in more detail than in the previous version of the study.

      (2) Maybe I misunderstood, but it sounds like the sampling is almost exclusively clades that are primarily herbivorous/omnivorous (Pantodonta, Arctostylopida, Anagalida, and maybe Tillodonta), which means that the full ecomorphological diversity of the time bins is not being sampled (e.g., insectivores aren't fully sampled). Similarly, the authors say that they "focused sampling" on those major clades and "Additional data were collected on other clades ... opportunistically" (L628). If they favored sampling of specific clades, then doesn't that also bias their results?

      If the study is primarily focused on a few herbivorous clades, then the Introduction should be reframed to reflect this. You could explain that you're specifically tracking herbivore patterns after the K-Pg.

      We appreciate the reviewer’s suggestion that our sampling may have focused on putative herbivorous clades more than others. However, at the early stage of placental evolution during the Paleocene, and in particular among the endemic forms we studied from south China, it is unclear to us that such clearcut ecomorphological categories were present amongst the fossil mammals. Thus, we take a more agnostic approach and do not define the dietary categories of the sample taxa (and by extension, those of the unsampled taxa). Although we recognize that representatives of certain clades, such as Carnivora, may be more reasonably interpreted as carnivores/insectivores/omnivores and, in the current context, remains unsampled, we point out the fact that including tooth samples from rare taxa such as carnivores likely would have biased the analyses temporally. Chinese Paleocene carnivores are known only from one of the three time intervals analyzed (representing only a handful of specimens), and so would potentially inflate the disparity in that time interval relative to the others (if dentitions specialized for carnivory is assumed to be present in the Paleocene). To clarify this point, we added a paragraph in the introduction:

      “A major challenge with expanding analyses of post K-Pg recovery to Paleocene mammal assemblages elsewhere in the world is the generally stratigraphically limited nature of early Cenozoic sequences. In Asia, Paleocene localities in China represent the best studied to date[11]. From the earliest Paleocene, highly regional and endemic faunas are known from a handful of sedimentary basins (Fig. S1A). Among the faunal elements, only the archaic clades Anagalida and Pantodonta are consistently sampled across the major subdivisions of the Paleocene[11]. An additional complication with ecomorphological analysis of these early mammals is the uncertainty in their dietary ecology, as they are beyond the reach of conventional phylogenetic bracketing approaches to dietary reconstruction. Phenomic analysis of the placental radiation supports insectivory as the ancestral diet of the hypothetical placental ancestor, but uncertainty in the post K-Pg availability of insects and plants in some regions leave some doubt as to the accuracy of this ancestral state reconstruction[1]. Herein we treat the archaic Paleocene taxa in our analyses as having generalized diets rather than categorizing them as insectivores, herbivores, or carnivores.”

      (3) There are a lot of topics lacking background information, which makes the paper challenging to read for non-experts. Maybe the authors are hindered by a short word limit. But if they can expand their main text, then I strongly recommend the following:

      a) The authors should discuss diets. Much of the data are diet correlates (DTA values), but diets are almost never mentioned, except in the Methods. For example, the authors say: "An overall shift towards increased dental topographic trait magnitudes ..." (L137). Does that mean there was a shift toward increased herbivory? If so, why not mention the dietary shift? And if most of the sampled taxa are herbivores (see above comment), then shouldn't herbivory be a focal point of the paper?

      We edited the introduction to say that “We used dental topographical traits as indicators of ecomorphological diversity[28] and examined temporal shifts in tooth crown complexity, curvature, and height and their association with tooth performance in terms of deformation resistance using topographic and simulation analyses.” And also added the following to the methods section, in order to clarify that we are using DTA as a general ecomorphological proxy, and not a direct dietary proxy.

      “Overall, we use these DTA traits as indicators of ecomorphological capacity, but do not link them explicitly to dietary categories. The craniodental morphology of archaic placental clades in general have not been demonstrated to share the same structure-function linkages as crown mammals, so the aforementioned linkages between DTA and dietary ecology in extant species only serve as evidence that DTA is a potentially useful ecomorphological proxy, without the application of those DTA-diet relationships to the Paleocene fossil mammal dataset.”

      b) The authors should expand on "we used dentitions as ecological indicators" (L75). For non-experts, how/why are dentitions linked to ecology? And, again, why not mention diet? A strong link between tooth shape and diet is a critical assumption here (and one I'm sure that all mammalogists agree with), but the authors don't provide justification (at least in the Introduction) for that assumption. Many relevant papers cited later in the Methods could be cited in the Introduction (e.g., Evans et al. 2007).

      We added the following sentence to clarify our usage of tooth crowns as ecomorphological proxies: “Teeth are among the most well-preserved parts of fossil mammals, and the fact that they interface directly with the environment through mastication makes them suitable elements for studying potential ecology-morphology linkages.”

      c) Include a better introduction of the sample, such as explicitly stating that your sample only includes placentals (assuming that's the case) and is focused on three major clades. Are non-placentals like multituberculates or stem placentals/eutherians found at Chinese Paleocene fossil localities and not sampled in the study, or are they absent in the sampled area?

      We modified the following sentence to indicate our sampling focus on placentals: “Our analyses focused on placental mammals from three of the most fossiliferous and biogeographically isolated Paleocene sedimentary sequences in paleotropical Asia: The Nanxiong, Qianshan, and Chijiang Basins in present-day south China 23–27 (Fig. S1)”

      d) The way in which "integration" is being used should be defined. That is a loaded term which has been defined in different ways. I also recommend providing more explanation on the integration analyses and what the results mean.

      If the authors don't have space to expand the main text, then they should at least expand on the topics in the supplement, with appropriate citations to the supplement in the main text.

      We replaced all mentions of “integration” with “covariation” to avoid using the loaded terminology. Covariation more accurately reflects the correlation between two sets of traits (DTA vs FEA) without invoking developmental mechanisms implied by modularity/integration.

      (4) Finally, I'm not convinced that the results fully support the 'brawn before bite' hypothesis. I like the hypothesis. However, the 'brawn before ...' part of the hypothesis assumes that body size disparity (L63) increased first, and I don't think that pattern is ever shown. First, body size disparity is never reported or plotted (at least that I could find) - the authors just show the violin plots of the body sizes (Figures 1B, S6A). Second, the authors don't show evidence of an actual increase in body size disparity. Instead, they seem to assume that there was a rapid diversification in the earliest Paleocene, and thus the early Paleocene bin has already "reached maximum saturation" (L148). But what if the body size disparity in the latest Cretaceous was the same as that in the Paleocene? (Although that's unlikely, note that papers like Clauset & Redner 2009 and Grossnickle & Newham 2016 found evidence of greater body size disparity in the latest Cretaceous than is commonly recognized.) Similarly, what if body size disparity increased rapidly in the Eocene? Wouldn't that suggest a 'BITE before brawn' hypothesis? So, without showing when an increase in body size diversity occurred, I don't think that the authors can make a strong argument for 'brawn before [insert any trait]".

      Although it's probably well beyond the scope of the study to add Cretaceous or Eocene data, the authors could at least review literature on body size patterns during those times to provide greater evidence for an earliest Paleocene increase in size disparity.

      We added a sentence in the discussion of body size during the Paleocene to note that the largest late Cretaceous fossil mammals in China are shrew- to gopher-sized, whereas the largest early Paleocene Chinese Endemic Pantodonts are dog-sized:

      “Dog-sized CEPs such as Bemalambda reached sizes not seen in late Cretaceous mammals from China such as Zhangolestes and Kryptobaatar, which are shrew- to gopher-sized [Meng 2014]”

      Reference: Meng, J. (2014). Mesozoic mammals of China: implications for phylogeny and early evolution of mammals. Natl. Sci. Rev. 1, 521–542. 10.1093/nsr/nwu070.

      Furthermore, we tempered our discussion to restrict the “brawn before bite” hypothesis to post K-Pg recovery in the Paleocene. Body size patterns shifted in the Eocene as crown clades replaced the archaic endemic clades analyzed in our study, and much larger taxa began to appear after the PETM. Such body size shift patterns are based on different clades and likely different dynamics compared to the 10-million year interval examined in our study, so we refrain from commenting on post-Paleocene times.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      (1) In regard to the DTA dataset: Was there a method used to 'fix' these teeth before dental topographic analyses were implemented? If so, this should be explicitly stated. If not, the authors should explain why broken, worn, or abraded teeth were used.

      We excluded the incomplete teeth from our analyses. We added the following sentence for clarification: “These tooth positions were selected from a broader examination of ~300 individual teeth from 72 specimens. We vetted the specimens and excluded 99 tooth positions (~33% of teeth initially chosen for possible inclusion) from our analyses because they either (1) were partially or completely broken at the crown, (2) were in an advanced stage of attritional wear where no cusps could be identified, or (3) possessed a combination of the two aforementioned conditions.”

      (2) The authors should explicitly explain why all tooth positions were analyzed together. Again, this is not something that is typically done, and some explanation would be helpful for readers.

      We added a paragraph in the methods section to explain both our pooled sampling approach, as well as the per-tooth analyses added in this revised manuscript:

      “Given the rarity of Paleocene fossil material from China, we combined data from different tooth positions into three pooled samples, one for each of the time intervals examined (early, middle, late Paleocene). We treated the pooled samples as representative of the range of dental topographic features and bite performance traits available to the mammal taxa under study. In this way, the variance estimates are interpreted as measures of the morphological and performance heterogeneity present in each time interval dataset. To further tease out the possibility of specific tooth positions driving the overall trends observed in the pooled samples, we also performed the DTA, FEA, DTA-FEA correlation, and tooth size through-time analyses using per-tooth data partitions.”

      (3) I think the authors should hedge their claims a bit more and recognize the limitations of their study (e.g., sample size and tooth preservation).

      We thank the reviewer for raising this important point. We carefully read through the main text and further tempered our interpretations based on the limitations of our data. Additionally, we added a paragraph in the supplemental text to summarize the major sources of uncertainty in the sample:

      “Sample and methodological limitations

      The highly fragmentary nature of early Cenozoic mammal fossils in Asia means that even the best preserved faunas studied herein contain much missing information. First, the absence of a high-resolution chronological framework prevents the fossil data from being analyzed on a continuous time axis; the binning of the samples into three main intervals within a 10-million-year period hinders additional hypotheses about the environmental and climatic correlations of the dental structure-performance results presented. Second, the uneven sampling of the available mammalian assemblage throughout the Paleocene sites in China limits the breadth of ecomorphological categories included in the analyses; rarer taxa representing more specialized carnivore, insectivore, or herbivore forms were not included in our sampling. Third, the spatial discontinuity of stratigraphically younger (Eocene) and older (Cretaceous) mammal assemblages means that body size and ecomorphological shifts bracketing the Paleocene cannot currently be analyzed alongside the dataset presented. These limitations should be taken into account when considering the interpretations made in the main text.”

      Reviewer #2 (Recommendations for the authors):

      I'm including my Line Comments here as recommendations for the authors. But note that many of my recommendations are also in my Public Review.

      L22: "3% of sites"? Do you mean 3% of global sites?

      Yes, we revised the sentence to indicate 3% of global sites. Thank you for this suggestion.

      L35: This is nitpicky because it's not crucial to your study, but I can't help but point out that the Long Fuse, etc, hypotheses are specifically about the DIVERGENCE TIMES for Placentalia and major subclades, NOT the 'adaptive radiation' of placentals like you imply in your text. Adaptive radiations include ecomorphological diversification and are driven by ecological opportunity (e.g., Schluter 2000). (Emphasis on 'ecological.') The long fuse, short fuse, and explosive models do not include an ecological component - i.e., the diversifications could have occurred without ecological diversification. Instead, for hypotheses that are specifically on the adaptive/ecological radiation of mammals, see the Early Rise, Suppression (or Dinosaur Incumbency; Benevento et al. 2023 Palaeontology), and Late Rise hypotheses (Grossnickle et al. 2019 TREE). These hypotheses apply broadly to all mammals, not just placentals (see Box 1's figure in Grossnickle et al. 2019), but they can still be applied to mammalian subclades like eutherians/placentals (e.g., see Thomas Halliday papers).

      Thank you for helping to clarify the adaptive radiation vs. divergence time concepts. We edited this sentence to mention the adaptive radiation hypotheses instead, adding in the references provided by the reviewer.

      L39-40: I think your comment is probably accurate. But keep in mind that advocates of the Early Rise and Delayed Rise hypotheses (see citations within Grossnickle et al. 2019) might argue that other time periods, other than the Paleocene, are equally or more important.

      We added a reference to Grossnickle et al. 2019 to bring attention to potential arguments otherwise. Thank you for the suggestion.

      L48: I think the inclusion of "at higher latitudes" is a little distracting or misleading and should be erased. It implies that the taxonomic diversification was ONLY rapid at higher latitudes. But many of the references that you cite include analyses at the global or continental scale (e.g., Alroy 1999, Grossnickle & Newham 2016) and don't distinguish patterns at different latitudes. If you want to keep the point about latitudes, then I recommend inserting a separate sentence on that point.

      We removed “at higher latitudes”.

      L50: Isn't "stem lineages and those with no living relatives" somewhat redundant? Or do you mean something like "stem placental/eutherian lineages and extinct placental subgroups"?

      Yes, we adopted the suggested phrasing. Thank you.

      L53: I recommend starting a new paragraph around here (maybe starting with "Distinct from ...") that focuses specifically on introducing the 'brawn before [ecomorphological trait]' hypothesis.

      Done.

      L56: "large herbivores and their predators"? Are you just referring to mammals? Wilson (2013), which you cite, and Grossnickle & Newham (2016) argued that dietary specialists were targeted at the K-Pg, but none of the herbivores were "large" (at least relative to Cenozoic herbivores). And most faunivorous mammals at the time were probably insectivorous and not preying on herbivorous mammals, besides maybe a few outlying taxa (e.g., Altacreodus, Nanocuris). I'd revise your sentence for clarity.

      We removed “disproportionately impacting large herbivores and their predators” for clarity.

      L63: I'd replace "ecometric" with "ecomorphological". Ecometrics commonly refers to using fossil traits to infer paleo environments/climate (e.g., see papers by David Polly, Michelle Lawing, etc), which I don't think is what you're referring to here. (E.g., I don't think that brain size or jaw shape patterns were/are used to infer paleo environments.)

      Revised. Thank you.

      L85: I strongly advise against making conclusions like this: "Dental height and sharpness variability ... [spiked] in the middle Paleocene corresponding to a short-lived negative excursion in global temperature." That implies that the change in dentitions is linked to global temperature changes, which I don't think your results support. Later in the text you highlight the temporal uncertainty of your time bin ages (L650) and say that the middle Paleocene bin could be as old as ~62 Ma (L646), which is well before the negative excursion (and looks to be more in line with a positive excursion!), at least according to the Figure 1 time scale (see comment below). So, I don't think that your results even support your statement.

      We reworded this sentence to say “Dental height and sharpness variability were low in the beginning and end of the time interval, with a peak in the middle Paleocene. This pattern is observed both when dentitions are considered holistically and by tooth position in the lower dentition (Fig. S5; upper teeth display the opposite pattern).”

      L144: Using variance for disparity seems fine. But keep in mind that other disparity metrics, such as range (or sum-of-ranges for multivariate data), might produce different results. For instance, variance of RFI and Slope spike in the middle Paleocene, like you point out, but based on the values in Figure 1A, it looks like the ranges stay relatively constant through the Paleocene (although I realize that the ranges might change with bootstrapping). So, your choice of disparity metric might have a big influence on your conclusions. Alternatively, you could calculate disparity using multiple metrics (e.g., Brusatte et al. 2012 Nature Communications; Grossnickle & Newham 2016 supplemental analyses), even if it's just for supplemental analyses.

      Thank you for bringing the choice of disparity measures to our attention. We conducted a parallel set of bootstrapped disparity calculation and comparison analyses using range lengths (maximum trait value – minimum trait value for a given trait) and summarized the through-time trends as for variance-based results (Fig. S5). Overall, very similar trends are observed, providing support for the variance-based data interpretation presented in the main text. We added explanation of this additional sensitivity testing both in the main text and in the supplemental text.

      L147: "body size disparity ... (Fig. 1B, S6A, Table 1, Data S5)." But I don't see disparity calculated or plotted in any of the figures/tables that you cite. You test for differences in disparity between time bins (Table 1), but that doesn't provide the actual disparity patterns.

      We generated a new figure (Fig. S8) to show the tooth size variance and range levels across time and data partitions, and modified this sentence to say that “Over the same time interval examined, body size disparity and mean were higher in the early Paleocene than in subsequent time intervals (Fig. S8, Table S3; also supported by premolar 4 and upper molar partition analyses), indicating that substantial increases in the disparity of dental complexity, curvature, and height lagged behind maximum size disparity tooth size during the Paleocene.”

      L151-153: Maybe. But you're basing this on a much narrower temporal range (Paleocene) than the brain and jaw studies, and I think those studies observed big increases in brain/jaw disparity in the Eocene, which you don't sample. And as I explained elsewhere, I'm not convinced that your results strongly support the same pattern. At a minimum, I recommend tempering your conclusions to better reflect the uncertainty of your results.

      We tempered our statements here to say that “This suggests a ‘brawn before bite’ pattern in endemic Asian mammals, partially mirroring the endocranial and jaw functional morphology patterns identified in their North American and European counterparts [21,22]. These findings raise the possibility that an initial size-driven post-K-Pg recovery followed by ecomorphological radiation was a global phenomenon, even as regional tectonic events such as the initial collision of the Indian subcontinent with Asia and Deccan Traps volcanism influenced local mammal evolution.”

      L170: I'm not well-versed in integration (and modularity) studies, so maybe this reflects my ignorance, but I had trouble understanding sentences like this: "These findings indicate that form-function malleability, the coexistence of distinct topography-performance relationships in each time and taxon partition while overall integration between the two trait groups increases between time bins, was present throughout the Paleocene." If there is space, I recommend revising and/or breaking apart long, jargon-y sentences like that (throughout the paper) so that they're more digestible for readers.

      We simplified complex sentences such as the one the reviewer noted, in order to communicate our findings and interpretations more clearly. Thank you for the suggestion.

      L183: It's probably fine to assume most placental orders arose in the Paleocene based on fossil evidence. But keep in mind that molecular studies often argue that many orders arose in the Late Cretaceous.

      We revised the statement to indicate a “Cretaceous/Paleocene” origin of many modern mammal orders.

      L200-207: Again, this might just reflect my ignorance concerning integration analyses, but I recommend expanding on this text to better explain how your integration results support this conclusion. It seems really interesting, and I like the Garden of Eden hypothesis. It's just not immediately clear to me how your results support that hypothesis. A little more background on how to interpret the integration results would be helpful.

      We expanded the discussion here to say that “Such flexibility in dental form-function linkage permits ‘mix and match’ trait combinations rather than evolutionary change as a single unit, potentially enhancing the evolvability of feeding ecological traits as new environmental conditions arose [Goswami et al. 2015]”

      Reference: Goswami, A., Binder, W.J., Meachen, J., and O’Keefe, F.R. (2015). The fossil record of phenotypic integration and modularity: A deep-time perspective on developmental and evolutionary dynamics. Proc. Natl. Acad. Sci. 112, 4891–4896. 10.1073/pnas.1403667112.

      L218: "reached maximum tooth size disparity early". Again, I don't see size disparity plotted or reported. And without baseline comparisons (Late K or Eocene), it's hard to interpret your results and evaluate what 'maximum' means (Figure 1B).

      We revised the sentence to now say “In response, Paleocene mammal clades in south China between dental topography and bite performance later, all the while maintaining high levels of variability in dental complexity and convexity (Fig. 1).”

      Figure 1A: The time scale in the top left of the figure looks off. Shouldn't the K-Pg be at 66 Ma (not 65 Ma) and the P-E boundary at 56 Ma (not ~54 or 55)?

      We revised Fig. 1 to fix the time scale so that K-Pg is at 65.5 Ma and the P-E boundary at 56 Ma. Thank you for catching this.

      Figure 1A: Is there a different y-axis scale for the variance (red line) results?

      Yes, the y axes for the variance curves were missing. We added them back in. Thank you.

      L628-629: As I explained above, it feels like you focused your sampling just on herbivorous/omnivorous groups, and, if true, this is an important point that should be discussed at the forefront of the paper. Does your sample truly represent the total ecological diversity of the mammalian faunas at the time?

      We agree with the reviewer about the potential partial sampling of the range of ecomorphological diversity when only the most abundant clades are included in the analyses. However, we refrain from interpreting the dietary groupings represented in the dataset using an assumption of functional morphology from crown/extant clades. We added a paragraph in the introduction to bring attention to the inherent uncertainty in the ecological diversity of the dataset:

      “A major challenge with expanding analyses of post K-Pg recovery to Paleocene mammal assemblages elsewhere in the world is the stratigraphically limited nature of early Cenozoic sequences that produce fossil mammals. In Asia, Paleocene localities in China represent the best studied to date 11. From the earliest Paleocene, highly regional and endemic faunas are known from a handful of sedimentary basins (Fig. S1A). Among the faunal elements, only the archaic placental clades Anagalida and Pantodonta are consistently sampled across the major subdivisions of the Paleocene 11. An additional complication with ecomorphological analysis of these early mammals is the uncertainty in their dietary ecology, as they are beyond the reach of conventional phylogenetic bracketing approaches to dietary reconstruction. Phenomic analysis of the placental radiation supports insectivory as the ancestral diet of the hypothetical placental ancestor, but uncertainty in the post K-Pg availability of insects and plants in some regions leave some doubt as to the accuracy of this ancestral state reconstruction 1. Herein we treat the archaic Paleocene taxa in our analyses as having uncharacterized diets rather than categorizing them as insectivores, herbivores, or carnivores. “

      L653: Sorry if this is mentioned elsewhere, but did you avoid using teeth with especially worn or broken cusps? You might expand on how you chose teeth for your sample.

      We left out this detail in the original submission. Thank you for pointing this out. We had to exclude a third of the teeth because they were too worn or broken. We added the following explanation to the methods section:

      “These tooth positions were selected from a broader examination of ~300 individual teeth from 72 specimens. We vetted the specimens and excluded 99 tooth positions (~33% of teeth initially chosen for possible inclusion) from our analyses because they either (1) were partially or completely broken at the crown, (2) were in an advanced stage of attritional wear where no cusps could be identified, or (3) possessed a combination of the two aforementioned conditions.”

      L654: "specimens" should be "teeth", correct? In the preceding sentence, you say that there are 200 teeth from only 48 specimens.

      Corrected.

    1. eLife Assessment

      This important study links allelic expression imbalance with replication timing, suggesting a stochastic model for haploinsufficiency in dosage-sensitive disease. The integration of allele-specific RNA-seq and replication timing in clonal systems provides solid evidence for an association between asynchronous replication and allelic imbalance, although the scope and generality should be addressed in future work. This study will interest epigeneticists and genome regulation researchers studying replication timing and monoallelic expression, as well as developmental biologists and human geneticists concerned with clonal heterogeneity, haploinsufficiency, and variable disease penetrance.

      [Editors' note: this paper was reviewed by Review Commons.]

    2. Reviewer #2 (Public review):

      Summary:

      The authors pair analysis of replication timing and allele-specific expression in clonal populations of primary human cells. They combine these data with previously published data on clones from transformed human cell lines. They identify a number of genomic regions that display asynchronous replication timing in at least one clone and correlate these regions with allele-specific expression of genes within them. They also observe that several interesting gene sets, including genes that are associated with human diseases, map to asynchronously replicating regions. This is a good experimental approach that builds on already published data demonstrating the connection between allelic imbalance and replication timing.

      - This is a research topic that touches on a few sub-fields of biology, and thus to make the paper more approachable we would recommend a careful edit of the text for clarity and precision of language.

      - Authors point out that this is a decades-old field; we would suggest to use terminology established within the field is possible. Allelic imbalance has been referred to as AI, MAE (monoallelic expression), RMAE (random monoallelic expression) etc. The paper whose mouse data the authors make use of uses Asynchronous Stochastic Replication Timing (ASRT) instead of VERT to refer to the same phenomenon.

      - Methods do not provide fully sufficient detail to fully evaluate or reproduce these experiments.

      - It is helpful to show representative loci as the authors do in Fig 1F and G and Fig 2 but these panels are very densely rendered and thus difficult to process visually - even the cartoon version (1D) is thick with overlapping lines. The point that allelic imbalance is enriched in VERTs would be enhanced if the authors could present the allelic ratio for all genes found in all VERTs, demonstrating how replication timing on either chromosome affects the allelic ratio.

      - The authors make the important point that VERTs are unlikely to be shared among different cell types and tissues (Fig 1i), but then find an enrichment for neuronal and immune genes in VERT regions identified in ACPs. It follows that these same genes are unlikely to be in such regions in the tissues where they are relevant. Some of the GO terms presented are too broad to suggest any biological significance to the result, even if there is statistical significance (for example, the top term for LCL clones 'Cytoplasm' is associated with 12,000 genes, and the second term for mouse clones 'Membrane' is associated with 10,000). It would be helpful to focus on GO terms lower in the GO hierarchy.

      - Figure 3 highlights the association of related gene clusters with VERTs but the VERTs are assigned based on variable replication timing in just 1 or 2 clones. This is an interesting observation, but to make the point that "VERT regions frequently coincide with gene clusters in the human genome" there needs to be a systematic assessment of replication timing at all gene clusters across all clones, and a statistical test for significance.

      - It is an interesting hypothesis that VERTs are conserved between species at syntenic loci. If such regions are really conserved, one would expect that replication timing at these sites would be consistently asynchronous. However the data presented shows that in human clones these VERTs can be specific to an individual donor (as in 5A) or an individual clone (as in 5H).

      - The finding that VERTs coincide with neurodevelopmental disease genes in immune and cartilage cells is at odds with the previous statements and data about the tissue specificity of VERTs. In order to support the claim that neurodevelopmental disease associated genes reside in asynchronously replicating regions, and are thus more prone to allelic imbalance, it would be helpful if the authors demonstrated this phenomenon in neuronal cells.

      - The authors consistently lean on sparse samples (i.e. a single clone) within a modestly sized dataset (4 clones from 2 donors each) to propose a new model for haploinsufficiency in human disease. It may well be but the consistent focus on limited elements in the data and perhaps an overreach in the interpretation makes it difficult to appreciate the very good experiments presented.

      - This section refers to the revised version of the paper.

      We would like to thank the authors for the changes and explanations offered. Although we don't fully agree with a few answers offered, overall the answers and changes in the manuscript have significantly improved the work presented. As such it should be of interest to many readers.

    3. Author response:

      The following is the authors’ response to the original reviews

      General Statements

      We thank the reviewers for their thoughtful and constructive comments, which substantially improved our manuscript. In response, we have revised the text and figures throughout to address the points raised. Specifically, we have:

      i. Refined our definition of Inactivation/Stability Centers (I/SCs): We limit this designation to loci where both Allelic Expression Imbalance (AEI) and Variable Epigenetic Replication Timing (VERT) were detected, either in the present study or in previously published work.

      ii. Expanded methodological clarity: We provide detailed descriptions of how VERT regions were identified, annotated, and quantified, including thresholds for allelic imbalance, replication timing variability, and sampling depth. We also justify the ≥80% AEI cutoff, which is based on recently published studies showing that modest allelic biases can have biological and clinical significance.

      iii. Enhanced benchmarking and validation: In addition to the analysis of X inactivation in female ACP cells, we now include comparisons between imprinted and non-imprinted regions to benchmark the magnitude of allelic replication timing imbalance, demonstrating that the magnitude of imbalance observed at non-imprinted VERT regions is comparable to known imprinted regions.

      iv. Address tissue specificity and sampling limitations: We now discuss how the data derived from a limited number of clones, tissues, and individuals support the identification of robust AEI and VERT patterns.  In the future, additional tissues and individuals will be required to capture the full diversity of I/SC regulation.

      v. Clarify biological relevance: We have expanded our discussion to highlight the consistency of AEI findings across cell types, including examples of genes implicated in neurodevelopmental and neurodegenerative disorders, and we clarify our model of how I/SC regulation contributes to haploinsufficiency, variable expressivity, and incomplete penetrance in human disease.

      vi. Improved figures and supplemental data: We have updated figure legends for clarity, added a new supplementary figure benchmarking imprinted regions, added supplementary tables containing: the full description of our GO analysis, the list of I/SCs where we have detected both VERT and AEI, the ratios of the number of transcripts derived from early and late replicating alleles for the I/SCs illustrated in all figures, and we have cross-referenced all supplementary tables.

      Point-by-point description of the revisions

      Reviewer 1:

      The existence of VERT regions is well supported, but the number of regions called as ISCs may be inflated by permissive thresholds (e.g., AEI {greater than or equal to} 0.8 or {less than or equal to} 0.2 in a single clone). This risks conflating transient stochastic differences with stable ISCs.

      We selected the >80% (or <20%) allelic imbalance threshold, along with the requirement of at least one biallelic clone, as our criterion for significant AEI. This choice was guided by a recent study demonstrating that allelic imbalance, as low as a 65%/35%, is enough to effect disease penetrance in humans (Nature 2025; 637:1186–1197). For completeness, results obtained using more stringent thresholds (>90% and >95% imbalance) are presented in Supplementary Table 2.

      Furthermore, it is unlikely that transient stochastic differences in allelic expression, such as those detected by single-cell RNA sequencing assays (Nat. Rev. Genet. 2015; 16:653–664), would be captured by our approach. Each clone in our study was expanded from a single cell to over one million cells before both RNA-seq and Repli-seq analyses, effectively averaging out transient transcriptional and/or replication fluctuations, and thus reflecting stable, mitotically heritable epigenetic states.

      Reviewer 1:

      More robust approaches would include using magnitude of imbalance, annotating VERTs by genomic location, applying stricter thresholds for replication timing, and benchmarking AEI distributions against the X chromosome.

      All VERT regions identified in this study were annotated according to both the magnitude of allelic imbalance and their genomic coordinates, using 250 kb windows for the human samples and 50 kb windows for the mouse samples (see Supplementary Tables 1 and 6). Figure 1c directly compares the magnitude of imbalance, defined as outliers in the standard deviation, for both allelic replication timing and allelic expression across autosomal and X-linked loci in female ACP cells.

      In addition, we detected allelic replication asynchrony at 12 known imprinted loci, and the standard deviation of replication timing at these loci, measured in 250 kb windows, is comparable to that observed across the >350 VERT regions detected at non-imprinted sites. For comparisons, we have highlighted the imprinted regions with + symbols in Figures 1e, 2d, 3c, 6g, 7e, 7g, and we have highlighted the imprinted regions in Supplemental Table 1, and in the Data Source files. For additional comparisons, we have included Supplemental Figure 1 to illustrate the magnitude of replication timing imbalance and allele-specific gene expression at two autosomal imprinted regions.

      Reviewer 1:

      Figures and text would benefit from improved clarity: axis labels are missing in places (e.g., Fig. 1c, Fig. 2g), legends should explain chromosome arm colors, and cluttered figures such as Fig. 1j could be re-visualized for interpretability.

      Figure labels have been added to Figs. 1c and 2g, and legends modified for clarity.

      Reviewer 1:

      “…the claim of cell-type specificity is not convincingly demonstrated given the small sample size (n=4) and strong batch confounding between lymphoblastoid and cartilage progenitors.” And “Hierarchical clustering is confounded by batch and based on presence/absence calls that lack quantitative resolution.”

      We agree that the limited number of individuals and clones, as well as the comparison between only two distinct tissue types (LCLs and ACPs), have quantitative limitations. Our primary intent was to evaluate whether any I/SCs were shared between independently derived clonal cell lines from different tissues to determine whether there is evidence of tissue-specific I/SC usage, rather than to make quantitative claims about global cell-type specificity.

      To address this concern, we have replaced the hierarchical clustering analysis, in Figure 1i, with a Venn diagram that more directly illustrates the overlap and tissue-specific distribution of VERT regions detected in the different clonal sets. This revised representation avoids assumptions about clustering relationships and removes batch-driven bias, while still conveying the key observation that many VERT regions are shared across tissues and others appear tissue-restricted.

      Reviewer 1:

      While syntenic VERT regions across mouse and human are intriguing, they complicate interpretation of strong clustering by cell type. Sampling depth may also have exaggerated allelic imbalance calls.

      We note that the human LCLs used in our study are B cells, and immunoglobulin gene rearrangements were used to confirm the clonal uniqueness of each line. Similarly, the mouse replication timing data analyzed here was generated from pre-B cells, which also undergo immunoglobulin gene rearrangements. Thus, both the human LCL and mouse pre-B cell datasets were derived from B-cell lineages, providing a consistent cellular context for comparative analysis.

      Sequencing depth is an important consideration for all variant base calls. Without fully haplotype-resolved genomes, previous studies relied on calculating per-SNP calls of allelic imbalance based on reads covering a single nucleotide locus. To improve sequencing depth supporting the identification of VERT and AEI regions, we utilized haplotype-resolved genomes that allowed all informative allele-specific reads to be pooled across all heterozygous SNPs within genomic windows or expressed genes. For AEI, we set a minimum threshold of 20 informative allele-specific reads per gene, a minimum FDR-corrected p-value of <=0.05, and a minimum of 80% vs 20% allelic imbalance. Importantly, a recent study showed that allelic imbalance as low as a 65%/35% is clinically relevant in humans (Nature 2025; 637:1186–1197). We reiterate that more stringent thresholds (>90% and >95% imbalance) are presented in Supplementary Table 2.

      Reviewer 1:

      Gene set enrichment analysis should be restricted to avoid inflated significance from overly broad categories.

      Reviewer 2:

      Some of the GO terms presented are too broad to suggest any biological significance to the result, even if there is statistical significance (for example, the top term for LCL clones 'Cytoplasm' is associated with 12,000 genes, and the second term for mouse clones 'Membrane' is associated with 10,000). It would be helpful to focus on GO terms lower in the GO hierarchy.

      We now include our complete Gene Ontology analysis, with more specific biological categories, in Supplemental Table 5.

      Reviewer 2:

      Allelic imbalance has been referred to as AI, MAE (monoallelic expression), RMAE (random monoallelic expression) etc. The paper whose mouse data the authors make use of uses Asynchronous Stochastic Replication Timing (ASRT) instead of VERT to refer to the same phenomenon. Creating unnecessary jargon makes the paper more difficult to read and adds needless complexity to an already complex field.

      While we agree that allelic expression imbalance has been described by different investigators using many different phrases, we believe that MAE, RMAE and AI do not represent an accurate description of the phenomenon. In our study [and our previous study; Nat Commun. 2022; 13(1):6301] we used clonal analysis of allele-specific expression and found that while some clones display equivalent levels of expression between alleles of a given gene (i.e. bi-allelic expression) other clones express only one allele (i.e. mono-allelic expression), and yet other clones have undetectable expression (i.e. silent on both alleles). This pattern of allele-restricted expression indicates that each allele independently adopts either an expressed or silent state. Importantly, because these expression states are mitotically stable, allele-autonomous, and independent of parental origin, we refer to the choice of the expressed allele as stochastic. Given this variability, we believe that the phrase “Allelic Expression Imbalance” (AEI) represents a more accurate descriptor for this phenomenon. We also point out that “Allelic Expression Imbalance” has also been used by other investigators >120 times in the Pubmed database.

      In addition, the replication asynchrony that exists at these loci is not consistent with purely ASynchronous Replication Timing (ASRT) between alleles. We found that each allele can independently adopt either earlier or later replication timing in different clones. This variability results in some clones exhibiting pronounced asynchrony between alleles, while in others, the two alleles replicate synchronously, with both adopting either the earlier or later timing state. As reported in our previous study (Nat. Commun. 2022; 13:6301), this behavior reflects a stochastic and allele-autonomous process, leading us to describe these loci as exhibiting Variable Epigenetic Replication Timing (VERT), which we believe is a more accurate descriptor of this phenomenon.

      Reviewer 2:

      The point that allelic imbalance is enriched in VERTs would be enhanced if the authors could present the allelic ratio for all genes found in all VERTs, demonstrating how replication timing on either chromosome affects the allelic ratio.

      The stochastic nature of allelic expression and replication timing observed at VERT loci indicates that each allele independently acquires its epigenetic state. In addition, there are typically more than one transcription unit, both protein coding and non-coding, within each VERT region, and each transcription unit also acquires its expressed or silent state independently.  Therefore, the expressed or silent status of one allele of a transcription unit does not predict the replication timing or expression status of the same or opposite allele of any other transcription unit within the VERT region. Accordingly, the Early/Late pattern of replication timing that we detect, both in this study and in our previous work (Nat. Commun. 2022; 13:6301), is not correlated with which allele is transcriptionally active. This supports our conclusion that asynchronous replication timing is not a downstream consequence of monoallelic transcription, but rather an independent epigenetic feature of I/SCs. Regardless, because each transcription unit is independent, we provide the expression ratios for all transcripts that are generated from the VERT regions for the coding and non-coding transcription units in Figures 1, 2, and 6; shown in Supplemental Table 9. This analysis indicated that 4,017 informative reads were derived from the earlier replication allele and 3,161 informative reads were derived from the later replication allele, generating an allelic ratio of 1.3 (early/late) and a binomial P value of 1.0.

      In addition, a similar analysis of imprinted loci reveals that even at genomic regions with parent-of-origin–specific expression, the replication timing of each allele does not align with transcriptional activity, i.e. both early- and late-replicating alleles can be transcriptionally active, depending on the gene. This observation is consistent with the complex organization of many imprinted domains, where genes on opposite alleles exhibit reciprocal expression patterns. To illustrate this point, we now include Supplemental Figure 1 demonstrating that imprinted loci harbor genes expressed from both the earlier- and later-replicating alleles. In addition, quantification of the total number of transcripts at the DLK1/MEG8 imprinted locus (Supplementary Figure 1a-1c) indicates that the ratio of transcripts derived from the early versus late replicating alleles is equivalent (i.e. a ratio of 1.0; See Supplemental Table 9).

      Reviewer 2:

      Figure 3 highlights the association of related gene clusters with VERTs but the VERTs are assigned based on variable replication timing in just 1 or 2 clones. This is an interesting observation, but to make the point that "VERT regions frequently coincide with gene clusters in the human genome" there needs to be a systematic assessment of replication timing at all gene clusters across all clones, and a statistical test for significance.

      Our intent in Figure 3 was not to suggest that all gene clusters are subject to VERT and AEI, but rather to highlight that several well-characterized multigene families that are known to exhibit AEI, such as olfactory receptor, protocadherin, and HLA gene clusters, coincide with VERT regions at their genomic locations. These examples serve as representative illustrations demonstrating that I/SC-associated regulation occurs at established AEI loci organized in gene clusters.

      To clarify this point, we have revised the text to explicitly state that Figure 3 presents illustrative examples of known AEI-associated gene clusters overlapping with VERT regions, rather than a comprehensive or statistically exhaustive analysis of all gene clusters across the genome.

      Reviewer 2:

      It is an interesting hypothesis that VERTs are conserved between species at synentic loci. If such regions are really conserved, one would expect that replication timing at these sites would be consistently asynchronous. However the data presented shows that in human clones these VERTs can be specific to an individual donor (as in 5A) or an individual clone (as in 5H).

      As discussed in our Limitations Section, our analysis was restricted to a limited number of cell types, clones, and individuals, which may not capture the full diversity of I/SC usage across tissues and populations. While our dataset was sufficient to identify robust patterns of AEI and VERT, it likely represents only a subset of the broader landscape of I/SC regulation in both humans and mice. We anticipate that future studies incorporating a wider range of tissues, individuals, and clonal analyses will uncover an even greater degree of conservation and diversity in I/SC usage across genomes.

      Reviewer 2:

      In order to support the claim that neurodevelopmental disease associated genes reside in asynchronously replicating regions, and are thus more prone to allelic imbalance, the authors would need to demonstrate this phenomenon in neuronal cells.

      We make two points that address this critique: First, many of the neurodevelopmental disease genes located within or adjacent to VERT regions are not exclusively expressed in neuronal cells and have previously been shown to exhibit AEI in non-neuronal contexts. For example, Gimelbrant and Chess (Science, 2007; 318:1136–1140) demonstrated AEI of the Parkinson disease genes SNCA and LRRK2 in lymphoblastoid cell lines (LCLs), and in our previous study, we detected AEI of DNAJC6, another Parkinson disease gene, also in LCL cells (Nat. Commun. 2022; 13:6301). In the present study, using cartilage progenitor cells, we identified VERT and AEI of several epilepsy-associated genes, including SCN1A, SCN2A (Fig. 6b), GABRA1(Fig. 6e), and SAMD12 (Fig. 6j), as well as a gene implicated in autism and neurodevelopmental disorders, SEMA5A (Fig. 5c), indicating that these genes are not exclusive to neuronal cell types.

      Second, independent studies from the Dr. E. Heard laboratory have provided further evidence that AEI occurs in neuronal lineages. Using mouse neural progenitor cells (NPCs), they identified genes subject to AEI (Dev. Cell, 2014; 28:366–380) and they later evaluated AEI of syntenic human neurodevelopmental disease genes, including Snca, App, Eya4, and Grik2 (Nat. Commun. 2021; 12:5330). In addition, and consistent with our use of AEI, they used the phrase “Allelic Expression Imbalance” to describe the epigenetic expression biases at these genes.

      Together, these findings reinforce that AEI, and by extension I/SC regulation, is not restricted to specific cell types, but rather represents a generalizable mechanism of stochastic epigenetic regulation that includes genes relevant to neurodevelopment and disease.

      Reviewer 2:

      However, the authors consistently lean on thin evidence (i.e. a single clone) within a modestly sized dataset (4 clones from 2 donors each) to propose a new model for haploinsufficiency in human disease. The consistent focus on limited elements in the data and perhaps an overreach in the interpretation makes it difficult to appreciate what is in fact a very good experiment.

      We agree that our analysis was conducted on a modest number of clones and individuals, which we explicitly acknowledge as a limitation of the present study. However, several key points support the robustness and broader relevance of our conclusions:

      i. Clonal Design and Replication: The strength of our approach lies in its clonal resolution. Each clone represents a single-cell–derived population expanded to over a million cells, enabling direct detection of stable, mitotically heritable allele-specific epigenetic states that would not be apparent in population-averaged data. Importantly, many of the VERT regions we identified are shared between independent clones from different donors and across distinct cell types (ACP and LCL), demonstrating reproducibility and biological consistency.

      ii. Cross-Species Validation: We further identified syntenic VERT regions in mouse pre-B cell clones, including at loci known to exhibit AEI in prior studies, providing independent validation and evolutionary conservation of the phenomenon.

      iii. Integration with Published Evidence: Our findings extend prior observations of AEI and VERT (e.g. Gimelbrant et al. Science 2007; Heskett et al. Nat. Commun. 2022) and are fully consistent with known stochastic allelic expression imbalance of autosomal genes. We also draw parallels with the absence of cellular selection mechanisms that dictate dominant inheritance patterns for loss of function alleles for X linked disease genes (reviewed in: J Clin Invest, 2008, 20-23; and Nat Rev Genet. 2025, 26, 571–580). Our proposed model linking I/SC regulation to haploinsufficiency is therefore a synthesis of our results with an extensive body of published data, not an inference drawn from isolated observations.

      iv. Scope and Framing: We have revised the manuscript to clarify that our proposed model represents a mechanistic framework, not a definitive or exclusive explanation, for how stochastic allelic regulation could contribute to dosage-sensitive disease phenotypes. We also explicitly discuss the need for larger datasets and additional tissues to refine and test this model.

      In summary, while we recognize the limited sampling depth inherent to clonal analyses, the consistency of our observations across donors, cell types, and species, together with prior corroborating studies, supports the validity of the conclusions and justifies the broader conceptual implications.

    1. eLife Assessment

      This important study highlights how cell size influences various cellular responses, with a particular focus on ferroptosis. The evidence presented is convincing, employing multiple model systems and experimental approaches to support the conclusions. This work will be of significant interest to the fields of cell size, ferroptosis, and cancer biology.

      [Editors' note: this paper was reviewed by Review Commons.]

    2. Reviewer #1 (Public review):

      Summary:

      The study by Zatulovskiy et al. examined how cell size influences cell susceptibility to ferroptosis. The authors found a size dependence specifically for ferroptosis-inducing drug Era2, but not for other drugs. Using various human cell lines (HMEC, HT 1080, RPE 1), the authors generated populations of small and large G1 cells by FACS, CDK4/6 inhibition (palbociclib), or inducible cyclin D1 knockdown, and measured cell susceptibility to ferroptosis. Larger cells were more resistant than smaller cells. Mechanistically, larger cells showed reduced plasma membrane lipid peroxidation, higher glutathione concentrations, and changes in relevant cellular proteins levels, as analyzed using previously published data. Deleting ACSL4, which is involved in ferroptosis, partly eliminated the size dependence of ferroptosis. The work concludes that cell size is a key determinant of ferroptosis susceptibility. Overall, this work expands our understanding of how cell size is correlated with functional properties of cells, which can have implications for biomedical sciences.

      Strengths:

      The study establishes a credible link between cell size and susceptibility to ferroptosis, as induced by Era2. Experimental replication is sufficient, and key conclusions rely on data from multiple cell lines and on multiple approaches to manipulate cell size. This suggests that the conceptual findings made in this paper could reflect a more fundamental feature of mammalian cells. In addition, this work provides an interesting contrast to another recent study about size-dependency of ferroptosis (https://doi.org/10.1016/j.isci.2025.112363), where increased cell size heightened sensitivity to the GPX4 inhibitor RSL3.

      Original Weaknesses:

      Disentangling cell size effects from other confounding factors, such as the cell cycle or overall metabolic rate, is challenging, and the authors have managed to qualitatively prove that cell size influences Era2-induced ferroptosis. However, the quantitative nature of this link between cell size and susceptibility to ferroptosis remains somewhat unclear due to the confounding factors that are present in many of their experiments. Notably, the quantitative nature of this link could also be cell type and growth condition -dependent, which remain to be investigated in detail. It should also be noted that this work focused on cell culture studies, and it remains unclear how much the findings of this paper could influence therapeutic strategies in vivo.

      Comments on revised version:

      I would first like to emphasize that I find this work solid, and I think the authors have done good work with the revisions.

      My only remaining recommendation is that the authors aim to more carefully examine the magnitude of the observed cell size-dependency in ferroptosis susceptibility. Their manuscript contains several experiments where the quantitative nature of this link remains unclear due to confounding factors, such as the cell cycle. For example, in Fig 2B&C, it seems that accumulation of cells in G1 (from ~60% to ~95%) decreases ferroptosis equally to the effect caused by cell volume doubling (from day 2 to day 4 of palbo treatment), suggesting that cell cycle has a much more pronounced effect on ferroptosis than cell size (especially when considering the size change from day 0 to day 2). However, the magnitude of the cell size effect is not consistent between all experiments shown. This is not surprising, as the authors use different approaches to changing cell size and different cell lines, but it makes the work more qualitative than quantitative. Notably, another confounding factor is the cell's metabolic/biosynthetic rate. It seems reasonable to assume that prolonged palbociclib treatment will decrease metabolic and protein synthesis rates (normalized to cell size), and this could make the cells less susceptible to ferroptosis. The rapamycin treatment results shown by the authors also support this notion. One approach to examining this could be to grow cells in various growth conditions to manipulate their growth & metabolic rate.

    3. Reviewer #2 (Public review):

      Summary:

      The authors set out to understand how cell phenotypes differ depending on the size of the cell, specifically here how cell size affects cell death. Using human cell lines (HMEC, HT-1080, RPE-1), the authors examined cell size through FACS sorting, CDK4/6 inhibition and inducible cyclin D1 knockdown. They identify that larger cells are more resistant to ferroptosis induced by system xc<sup>-</sup> inhibition (erastin2), but more sensitive to GPX4 inhibition (RSL3), highlighting pathway-specific size dependencies.

      Mechanistically, larger cells exhibited:

      - Higher glutathione levels, supporting lipid peroxide detoxification

      - Increased ferritin expression, promoting iron sequestration

      - Lower ACSL4 levels, reducing incorporation of peroxidation-prone lipids

      The findings are supported by high-throughput microscopy, flow cytometry (BODIPY-C11 lipid peroxidation assays), and proteomic analyses. The study concludes that cell size influences proteome composition and metabolic capacity, thereby shaping cell death decisions, an insight with implications for aging, cancer, and ferroptosis-based therapies.

      Major Strengths:

      - use of multiple cell lines to validate their findings

      - use of multiple, complimentary approaches

      - well designed screen and experiments throughout

      - clearly written, logical flow and easy to follow

      - relevance for multiple fields

      Weaknesses:

      - Lack of in-depth mechanistic investigation

      - Experiments are all in vitro and so, as yet, it is uncertain what the in vivo consequence would be

      General Assessment:

      This study presents a mechanistic link between cell size and ferroptosis susceptibility. Using high-throughput microscopy, proteomics, and genetic perturbations across multiple human cell lines, the authors demonstrate that larger cells are more resistant to ferroptosis induced by system xc<sup>-</sup> inhibition (erastin2). This resistance is attributed to elevated glutathione production, increased ferritin-mediated iron sequestration, and reduced ACSL4-dependent lipid peroxidation. The experimental design is rigorous and multifaceted, with consistent results across cell types and size manipulation methods. While the study is limited to in vitro systems, its conceptual and mechanistic insights lay the groundwork for future in vivo and translational investigations.

      Advance:

      This work is the first to systematically show that cell size directly influences ferroptosis susceptibility via proteome scaling. It reconciles previous findings that large cells are sensitized to GPX4 inhibition (RSL3) by demonstrating that the ferroptosis pathway targeted system xc<sup>-</sup> vs GPX4 determines the direction of size-dependent vulnerability. The study provides a conceptual advance by positioning cell size as a regulatory axis in cell death decisions, and a mechanistic advance by identifying size-dependent changes in glutathione metabolism, ferritin levels, and ACSL4 expression.

      Audience:

      This research will be of interest to specialists in cell death, ferroptosis, redox biology, and cancer biology. It also holds relevance for aging researchers and translational scientists exploring ferroptosis-based therapies. The findings may influence how cell size heterogeneity is considered in therapeutic design, particularly in oncology and senescence-targeting strategies.

      Comments on revised version:

      We have no additional comments after revision. Thank you for addressing our initial queries.

    4. Reviewer #3 (Public review):

      In this manuscript, Zatulovskiy and colleagues elaborate on their previous work describing cell size-dependent changes in the proteome by investigating whether these changes can be correlated in differences in cell physiology. Using a cleverly-designed high throughput screen, they searched for compounds that differently-sized cells display differential sensitivity towards. Their primary hit, Era2, is involved in the ferroptosis pathway and serves as the starting point for a detailed study of how excess cell size protects cells from ferroptosis-induced cell death via: 1) lower concentrations of ACSL4 (which produces peroxidation-prone PUFAs), 2) increased ferritin concentrations, and 3) increased GSH concentrations.

      Overall, the experiments in this manuscript are well-designed and interpreted. It is an extremely well-written manuscript with a clear trajectory of logic.

      Comments on the revised version:

      The authors have addressed my original concerns adequately. I do not need to see it again, if there are further revisions.

    5. Author response:

      General Statements

      We were pleased to see that all three reviewers support publication after revision. No one questions the premise that cell size influences ferroptosis susceptibility. The main concerns fall into two categories: (A) disentangling “Cell size vs cell cycle”, which is the biggest issue for Reviewer #1 and partially for #3. (B) Additional mechanistic tests including SLC7A11 and ferritin functional tests (Reviewer #2) and lysosomal iron (via LysoRhoNox) and some further ACSL4 experiments (Reviewer #3). Other reviewer concerns are more minor.

      In our revision, we have addressed the reviewer’s specific criticisms with additional experiments as described below. We believe the constructive feedback from peer reviews helped us to significantly extend our mechanistic findings and strengthen the manuscript through revision.

      Point-by-point description of the revisions

      Reviewer #1:

      Summary:

      The study by Zatulovskiy et al. examined how cell size influences cell susceptibility to ferroptosis. The authors found a size dependence specifically for ferroptosis-inducing drug Era2, but not for other drugs. Using various human cell lines (HMEC, HT 1080, RPE 1), the authors generated populations of small and large G1 cells by FACS, CDK4/6 inhibition (palbociclib), or inducible cyclin D1 knockdown, and measured cell susceptibility to ferroptosis. Larger cells were more resistant than smaller cells. Mechanistically, larger cells showed reduced plasma membrane lipid peroxidation, higher glutathione concentrations, and changes in relevant cellular proteins levels, as analyzed using previously published data. Deleting ACSL4, which is involved in ferroptosis, partly eliminated the size dependence of ferroptosis. The work concludes that cell size is a key determinant of ferroptosis susceptibility.

      My major concerns about this work focus on whether many of the results reflect cell size or cell cycle effects, and whether the FACS-based size-scaling analyses have some misleading features to their design & presentation. If these concerns can be addressed with new experiments, then the conclusions of this paper are justified. If these concerns cannot be addressed, then the authors should more directly acknowledge the alternative hypothesis that cell cycle effects may explain many of their results.

      The experiments seem to be replicated sufficiently, and most conclusions rely on data from multiple cell lines. My minor comments focus on needs to provide statistics and method details, and on suggestions on how to improve text clarity, but these edits are easily done and don't require new experiments. Overall, this is an interesting study, and it should be published once the concerns below are addressed.

      Major comments:

      In experiments reported in Fig 1 and 2A, the authors sort small and large cells in G1, plate them, and later start the drug treatments & cell monitoring. Are these cells actively cycling (progressing in the cell cycle), and how fast? The large cells are likely to enter S phase earlier than the small cells, so by the time that the authors start their drug treatments, they may be comparing cells in different cell cycle stages, which could influence drug sensitivity more than cell size (as the authors also suggest later in Fig 2). This needs to be controlled for.

      Furthermore, even if the cells remain in G1 after sorting until the drug treatments are started, the authors should address the fact that the drugs are present for a long time, thus targeting the cells in various cell cycle stages.

      We agree with the reviewer that the cell cycle stage could affect ferroptosis susceptibility and could be a confounding effect in asynchronous cells. One of us (Dixon) reported the cell cycle effects on ferroptosis previously, and we observe them in this manuscript too (Fig. 2B,C,E). We now state this more clearly both in the Results and in the Discussion sections, where we write:

      Line 159: “We note that non-arrested cells had a lower susceptibility to Era2-induced ferroptosis compared to cells that were arrested in G1 for 2-3 days, despite being smaller in size. This is likely due to the difference in the fraction of cells in different cell cycle phases between arrested and non-arrested conditions since cells in S/G2/M phases are known to be more resistant to ferroptosis than cells in G0/G1 phases (Rodencal et al, 2024; Kuganesan et al, 2023)”

      Line 533: “Cells in G1 phase of the cell cycle were reported to be more susceptible to ferroptosis (Rodencal et al, 2024; Kuganesan et al, 2023), which suggested that ferroptosis inducers could be used in combination with cancer drugs, like the CDK4/6 inhibitor palbociclib, that arrest cells in G1 phase of the cell cycle (Herrera-Abreu et al, 2024). However, while CDK4/6 inhibitors arrest cells in G1, they do not inhibit cell growth, such that the longer they are arrested, the larger the cells grow (Lanz et al, 2022; Crozier et al, 2023; Manohar et al, 2023). This results in a complex, nonmonotonic ferroptotic response dynamics in cells treated with CDK4/6 inhibitors (Fig. 2B,E). Just following CDK4/6 inhibitor treatment, as more and more cells are arrested in G1 phase, cells become more sensitive to both RSL3- and erastin-induced ferroptosis (Kuganesan et al, 2023; Rodencal et al, 2024). However, the longer the cells are arrested, the larger they become, which further promotes their susceptibility to RSL3 (Fig. S1B) but reduces their susceptibility to Era2-induced ferroptosis (Fig. 2B). The fact that the cell cycle arrest and cell size increase have opposing effects on Era2-induced ferroptosis susceptibility could explain why different studies reported seemingly contradictory results, where sometimes an increased and sometimes a decreased or unchanged sensitivity to system x<sub>c</sub><sup>-</sup> inhibitors was observed depending on the cell type, duration and type of cell cycle arrest (Lee et al, 2024; Kuganesan et al, 2023; Rodencal et al, 2024). Such complex interplay between the cell cycle and cell size effects on ferroptosis suggests that combination therapies utilizing CDK4/6 inhibitors and ferroptosis inducers would have to carefully choose a dosage schedule.”

      Given the potentially confounding effects of the cell cycle in cycling cells sorted by size, we performed an additional experiment, in which RPE-1 cells were pre-treated with the CDK4/6 inhibitor palbociclib to synchronize them in G1 phase prior to treatment. These cells were then continuously exposed to palbociclib during the Era2 treatment (Fig. 2C-E). RPE-1 cells pretreated with palbociclib for 2 and 4 days had the same cell cycle distribution with 94% of cells being arrested in G1, but with different sizes. Cells treated with palbociclib for 4 days were significantly larger and more resistant to Era2.

      Additionally, in the experiment shown in Fig. 5E,F, where we FACS-sorted WT and ACSL4 KO HMEC cells by cell size, and then measured Era2 susceptibility, we pre-treated the cells with palbociclib for 24 h to synchronize them in G1 prior to the sorting. We then cultured the cells in the presence of palbociclib during the Era2 treatment to avoid the cell cycle effects observed in Fig. 2. In this case, we still observe that larger cells are more resistant to Era2, consistent with our conclusion that cell size protects against Era2-induced ferroptosis.

      Can the G1 arrest-driven changes in drug susceptibility (Fig 2 C-D) be attributed to cell size? Can the authors rescue the palbociclib treatment with rapamycin or other growth inhibitors that allow size to remain small during G1 arrest?

      We have attempted to perform these experiments, but when we co-treated the cells with palbociclib and mTORC inhibitors, but observed variable results, which are likely due to the fact that prolonged mTORC inhibition itself rewires cellular metabolism and reduces cell susceptibility to ferroptosis, as one of us (Dixon) found previously (Armenta et al. (2022), Ferroptosis inhibition by lysosome-dependent catabolism of extracellular protein. Cell Chemical Biology 29: 1588-1600.e7). Our results were consistent with this previous report and is now included in a new supporting figure panel (Fig. S3C):

      Thus, upon palbociclib+rapamycin co-treatment there seems to be a competition between cellsize-mediated and metabolism-mediated effects of mTORC inhibition on ferroptosis, which leads to variable outcomes.

      In Fig 2E-F, is the cell cycle distribution of the samples influenced by CCND1 shRNA induction? Are the drug sensitivity effects due to cell size or cell cycle changes?

      The CCND1 manipulation model is extensively characterized in our recent work cited in this manuscript (You et al. (2025), Cell size-dependent mRNA transcription drives proteome remodeling. 2025.10.30.685141 doi:10.1101/2025.10.30.685141). Indeed, CCND1 shRNA cells have a slightly elongated G1 phase due to a ~30% reduction in Cyclin D1 concentration: the G1 fraction changes from ~70% in wild-type to ~80% in CCND1 shRNA cells, which could potentially affect the ferroptosis susceptibility, but the additional results obtained on synchronized RPE-1 cells, described above (Fig. 2C-E), support the conclusion that the primary effect on Era2 sensitivity is due to cell size.

      Can the authors address the meaningfulness of the FACS-based size-scaling results in cases where cell-to-cell variability is very large? For example, in Fig 4D&G, the results are so variable even in identically sized cells that the importance of the size-scaling pattern seems questionable.

      We do observe variability in fluorescent probe-based measurements of GSH and lipid oxidation, which could be due to biological (natural cell heterogeneity) and/or technical (low sensitivity of the probes) reasons. However, when we look at binned data and compare the mean values ± s.e.m. for each bin, we observe a robust and reproducible trend (black line with dark-grey shaded area), even though the SD is quite broad (lighter shaded area). We believe such trends are meaningful when describing cell death in probabilistic terms as we do. I.e., the GSH measurement might not be precise enough to predict cell death for a given individual cell, but the statistical trend is clear and these measurements help predict cell death probabilities for cells of different sizes.

      In Figs 4B-D, the cell size axis seems to have over 4-fold size variability, but when the authors show the analysis of this data (Figs 4E-G) the variability is only 2-fold. What was excluded and on what basis?

      To address this point, we have now clarified in the Methods section how the data were processed and what data points we excluded from this analysis:

      Line 671: “For all binned flow cytometry data plots, the cells below the 2nd and above the 98th cell size percentiles were excluded to remove the extreme outliers. Then, the remaining data were binned by size and plotted as background-corrected average fluorescence intensity for each bin against the bin’s average cell size. Bins with fewer than 200 cells were excluded from the analysis to reduce noise.”

      Typically, such pre-processing reduces the size range, mostly from the large-cell end, because of the long right tail of the size distribution containing a few very large cells.

      Based on the methods section & figure legends of Fig 4B-I, the RPE cells were not pre-sorted to include only G1 cells, nor did the assay account for cell cycle differences. How can these data be used to explain results from earlier figures, where analyses were exclusively focused on size differences in G1?

      This is a valid point: Cells in the GSH measurement experiment were not gated by Hoechst signal for G1 phase because the channel normally used for Hoechst staining was in this case occupied by the MCB probe. However, given the data in Fig. 4A,B showing that the GSH production machinery is superscaling when measured specifically in G1-phase cells, we believe the flow cytometry data in Fig. 4C-J showing GSH concentration increasing with cell size across the whole cell cycle is very likely true for G1 cells as well.

      Minor comments:

      I recommend clarifying in the early introduction that all size changes discussed are in the absence of DNA content increase.”

      We have now clarified this in the introduction (Line 41 and Line 81).

      The introduction seems to cite primary research and review paper in the same sentences, which is a bit misleading as the reviews don't seem to add new evidence.

      We have removed review citations where they did not provide additional context.

      OPTIONAL

      In the second introduction paragraph, consider the classification/description of the three different mechanisms. Currently, it seems that these mechanisms are not independent of each other, and the details provided about each mechanism are inconsistent.”

      We have now modified this paragraph to make the description more consistent.

      Please provide statistics for the IC50 values reported based on Fig 1C. Were small and large cells statistically different? Are the IC50 values reported as +/- standard deviation or some other metric?

      This has now been clarified in the text as follows:

      “For example, at the 72 h time point, the Era2 IC50 was 28 ± 11 µM (mean ± SD) for large cells versus 2.0 ± 1.4 µM for small cells (Student’s t-test: p = 0.039) (Fig. 1C).”

      Providing more insight into why Era2 and RSL3 treatments yield more opposite responses would be of great interest to the field.”

      We agree this is an important point that should be discussed in more detail. In the field of ferroptosis, context-dependent (i.e., cell type-specific) effects are common and multiple groups including our own (Dixon) have published extensively on genes and mechanisms that can lead to differences between erastin2 and RSL3 sensitivity. For example, there are studies showing that the mTOR pathway or the p53 pathway can either prevent or promote ferroptosis, depending on the cell type and/or other currently unknown variables. To address more specifically the differences between Era2 and RSL3 in the context our observed cell-sizedependent response, we have now added more data and discussion. In the Results section we added panel 4B and the following text:

      Line 359: “While the upregulation of GSH biosynthesis may promote the resistance of larger cells to ferroptosis, such an upregulation alone cannot explain why larger cells become more resistant to ferroptosis induced by the cystine import inhibitor Era2, but not, for example, by the GPX4 inhibitor RSL3 (Chan et al, 2025) (Figs. 2B, S1B). We found previously that upon mTORC1 inhibition cells can evade cystine deprivation-induced ferroptosis by uptake and catabolism of cysteine-rich extracellular proteins, mostly albumin (Armenta et al, 2022) (Fig. S3C). This process involves albumin degradation in lysosomes, predominantly by cathepsin B (CatB), and subsequent export of cystine from lysosomes to fuel the synthesis of glutathione. Large cells undergo proteome rearrangements similar to those occurring upon mTORC1 inhibition (Zatulovskiy et al, 2022). This suggests that large cells may upregulate CatB expression to bypass the Era2-induced cystine import inhibition via system xc-. To test this hypothesis, we used flow cytometry to measure how the expression of cathepsin B and the system xc- cystine/glutamate transporter SLC7A11 (xCT) scales with cell size (Fig. 4B). We found that SLC7A11 concentration modestly decreases, while CatB concentration significantly increases with cell size (Fig. 4B). This shift in the ratio between SLC7A11 and CatB supports the hypothesis that larger cells may rely less on cystine import via system xc- and thus become more resistant to system xc- inhibition by Era2.”

      Additionally, in the Discussion we added the following:

      Line 578: “We show that large cells may become resistant specifically to Era2 but not RSL3 through the upregulation of lysosomal function, particularly cathepsin B expression, which enables the uptake and catabolism of cysteine-rich extracellular proteins. A size-dependent shift in the ratio between SLC7A11 and cathepsin B makes large cells less dependent on cystine import via system xc-, and thus, more resistant to Era2. In addition to this, it was reported that RSL3 can induce ferroptosis independently of GPX4 and may target other selenoproteins (DeAngelo et al, 2025; Cheff et al, 2023), which could also contribute to the difference in sizedependent responses to RSL3 and Era2.”

      Is the BODIPY-C11 labeling specific to plasma membrane, as suggested by the writing of the authors, or do the results shown integrate signals over all cell membranes?

      We thank the reviewer for pointing this out. BODIPY-C11 581/591 stains many membranes in the cell, not just the plasma membrane. We have changed the wording in the manuscript to reflect this.

      How exactly is gating done for the flow cytometry samples? Especially when analyzing size-scaling, the results are likely to be sensitive to outliers, such as those seen in Fig 4C (a subpopulation of very low CFSE stained cells). Can the authors clarify their methods and/or display supplementary figures with gating examples?

      We have now specified our gating strategy in the Methods section (Line 663) and added a corresponding Supplementary Figure S5.

      In Fig 4, total protein staining was used as a control, whereas Fig 5B b-actin was used as a control. Why did the authors rely on different controls approaches for essentially the same measurements? Are these controls comparable?

      In our flow cytometry experiments, we consistently use live-cell total protein stain (CFSE) for live cells, and anti-Tubulin immunofluorescent staining for fixed cells, both of which scale in proportion to cell volume and act as a read-out for total cellular protein content (Lanz and Zatulovskiy et al., Mol Cell 2022; Berenson et al. MBoC 2019), which we use to calculate concentrations of other cellular components (analogous to loading controls). In Fig. 5B, betaActin is used as a reference - a protein whose concentration does not change with cell size, as opposed to ACSL4 whose concentration decreases with cell size. In this plot, both ACSL4 and beta-Actin amounts were normalized to alpha-Tubulin, which is analogous to a concentration calculation using loading control. This is now explained in more detail in the Figure legend.

      Reviewer #1 (Significance):

      I work in the cell size research field, and I am familiar with other related works in this field. My evaluation reflects a specialist's view of this study. Overall, this study will be of a large interest to a small group of specialists, and specific aspects of the work will also gain some interest from broader basic research audiences studying mechanisms of drug responses and ferroptosis in general. However, I do not see this work gaining very broad interest across larger audiences, simply because the field of cell size research is not of broad interest, and this is not a landmark study for the field.

      The field of cell size research has long searched for size-dependent functions, as these could help explain why cell size matters. This study is a nice addition to our field, helping establish ferroptosis as a size-dependent function. However, the significance of this work relies on how clearly the authors can establish that their results are cell size rather than cell cycle effects (see major comments above). Should the authors address these concerns, then this study will provide some conceptual and mechanistic insight.

      Regarding mechanistic insights, this work is in stark contrast to a recent study about sizedependency of ferroptosis (https://doi.org/10.1016/j.isci.2025.112363), where increased cell size heightened sensitivity to the GPX4 inhibitor RSL3, thus suggesting an opposite conclusion than what the authors observed with the drug Era2. The authors examined this contradiction, and while their results with the drug RSL3 agreed with the recent study, they did not explain why different drug mechanisms yield opposite results. Providing more insights into this discrepancy would increase the impact of this work.

      Regardless of the impact of this work, I want to emphasize that I am fully supportive of seeing this work published once the technical concerns have been addressed. Our field will benefit from this work, and this work could catalyze important future research. The general topic studied here has the potential to become very important.

      We thank the reviewer for their thoughtful assessment and for supporting publication pending resolution of the technical concerns. We respectfully disagree that our audience is likely narrow: Reviewer #2 noted broad relevance to specialists in cell death/ferroptosis, redox biology, cancer biology, aging, and translational efforts in ferroptosis-based therapies, and Reviewer #3 similarly emphasized both cell size and ferroptosis/cell death communities. We therefore believe the work will be of interest across multiple active fields, particularly because it highlights how cell size heterogeneity can shape drug response.

      We agree that the significance hinges on clearly distinguishing cell size from cell-cycle effects, and we have strengthened the corresponding controls/analyses and adjusted language accordingly (see responses to major comments above). We also addressed the reported discrepancy between Era2 and RSL3 size-dependencies by adding new data (Fig. 4B) and expanded discussion. We very much hope that the reviewer appreciates the efforts we have made to strengthen this manuscript and resolve the technical concerns. For these reasons, we believe this work will have an impact on several fields and gain a broad readership.

      Reviewer #2:

      Zatulovskiy et al. demonstrate that cell size modulates susceptibility to ferroptosis, a form of iron-dependent cell death driven by lipid peroxidation. Using human cell lines (HMEC, HT-1080, RPE-1), the authors examined cell size through FACS sorting, CDK4/6 inhibition and inducible cyclin D1 knockdown. They found that larger cells are more resistant to ferroptosis induced by system xc<sup>-</sup>⁻ inhibition (erastin2), but more sensitive to GPX4 inhibition (RSL3), highlighting pathway-specific size dependencies.

      Mechanistically, larger cells exhibited:

      - Higher glutathione levels, supporting lipid peroxide detoxification

      - Increased ferritin expression, promoting iron sequestration

      - Lower ACSL4 levels, reducing incorporation of peroxidation-prone lipids

      These findings were supported by high-throughput microscopy, flow cytometry (BODIPY-C11 lipid peroxidation assays), and proteomic analyses. The study concludes that cell size influences proteome composition and metabolic capacity, thereby shaping cell death decisions, an insight with implications for aging, cancer, and ferroptosis-based therapies.

      Major Comments

      (1) Direct evaluation of SLC7A11 abundance and function is needed

      The opposite size-dependent effects of erastin2 and RSL3 strongly suggest a role for SLC7A11/system xc<sup>-</sup> activity in size-dependent ferroptosis resistance. However, SLC7A11 levels were not quantified due to insufficient peptide detection in the proteomic data. o Direct measurement of SLC7A11 protein levels (immunoblotting or flow cytometry) in small vs large cells would test whether its expression scales with size.

      a) Functional perturbation (siRNA/CRISPR knockdown) followed by erastin2 treatment would provide mechanistic validation. o Use of additional SLC7A11 inhibitors (e.g., sulfasalazine, sorafenib) could further test whether the size resistance phenotype is xc<sup>-</sup>-specific.

      We agree that the difference in size-dependent responses to RSL3 and Era2 is an important point that needs further investigation and discussion, as other reviewers also pointed out. To address more specifically the differences between Era2 and RSL3 in the context of cell-sizedependent response, we have now added more data and discussion. In the Results section we added panel 4B measuring SLC7A11 and Cathepsin B scaling with cell size and the following text:

      Line 359: “While the upregulation of GSH biosynthesis may promote the resistance of larger cells to ferroptosis, such an upregulation alone cannot explain why larger cells become more resistant to ferroptosis induced by the cystine import inhibitor Era2, but not, for example, by the GPX4 inhibitor RSL3 (Chan et al, 2025) (Figs. 2B, S1B). We found previously that upon mTORC1 inhibition cells can evade cystine deprivation-induced ferroptosis by uptake and catabolism of cysteine-rich extracellular proteins, mostly albumin (Armenta et al, 2022) (Fig. S3C). This process involves albumin degradation in lysosomes, predominantly by cathepsin B (CatB), and subsequent export of cystine from lysosomes to fuel the synthesis of glutathione. Large cells undergo proteome rearrangements similar to those occurring upon mTORC1 inhibition (Zatulovskiy et al, 2022). This suggests that large cells may upregulate CatB expression to bypass the Era2-induced cystine import inhibition via system xc-. To test this hypothesis, we used flow cytometry to measure how the expression of cathepsin B and the system xc- cystine/glutamate transporter SLC7A11 (xCT) scales with cell size (Fig. 4B). We found that SLC7A11 concentration modestly decreases, while CatB concentration significantly increases with cell size (Fig. 4B). This shift in the ratio between SLC7A11 and CatB supports the hypothesis that larger cells may rely less on cystine import via system xc- and thus become more resistant to system xc- inhibition by Era2.”

      Additionally, in the Discussion we added the following:

      Line 578: “We show that large cells may become resistant specifically to Era2 but not RSL3 through the upregulation of lysosomal function, particularly cathepsin B expression, which enables the uptake and catabolism of cysteine-rich extracellular proteins. A size-dependent shift in the ratio between SLC7A11 and cathepsin B makes large cells less dependent on cystine import via system xc<sup>-</sup>, and thus, more resistant to Era2. In addition to this, it was reported that RSL3 can induce ferroptosis independently of GPX4 and may target other selenoproteins (DeAngelo et al, 2025; Cheff et al, 2023), which could also contribute to the difference in sizedependent responses to RSL3 and Era2.”

      (2) Functional tests of ferritin contribution to resistance are needed Although elevated ferritin (FTH1/FTL) levels in larger cells represent a strong correlational signal, definitive experimental evidence establishing causality is currently lacking. o Measuring the labile iron pool directly in size-stratified populations would strengthen the link. o Knockdown of FTH1 or FTL could reveal whether ferritin upregulation is necessary for the resistance of large cells to ferroptosis.

      We thank the reviewer for raising this point. We have now completed additional experiments, as suggested by the reviewer, and found that iron chelation is unlikely to mediate the sizedependent response to Era2. We have modified the manuscript accordingly and added the following data and discussion to address this point:

      Line 296: “The observed increase in ferritin concentration with cell size could therefore lead to additional Fe2+ ion chelation, which in turn would protect large cells from iron-dependent lipid peroxidation and ferroptosis. However, when we measured the concentration of labile intracellular Fe2+ using a fluorescent probe FerroOrange (Hirayama et al, 2020), we did not observe any size-dependent decrease in labile iron concentration (Fig. S2A). Previous work suggests a link between increased sequestration of ferrous iron in lysosomes and resistance to ferroptosis. It was reported that senescent cells, which are also large (Fig. S3A,B), gain resistance to ferroptosis through lysosomal alkalinization and sequestration of ferrous iron in lysosomes (Loo et al, 2025). We therefore tested whether the superscaling of lysosomes observed in large cells (Lanz et al, 2022; You et al, 2025) promotes Era2 resistance through lysosomal iron sequestration. To do this, we stained the cells with the lysosomal iron detection probe Lyso-FerroRed (Saimoto et al, 2025) and measured its scaling using flow cytometry (Fig. S2B). We observed that the amount of Lyso-FerroRed, and therefore, the amount of lysosomal iron, scaled in direct proportion to cell size, just like the total cellular protein content (Fig. S2B). These results indicate that iron chelation by ferritin and its sequestration in lysosomes are unlikely to play a crucial role in size-dependent decrease in Era2 sensitivity.”

      (3) Relevance to senescence should be addressed experimentally or explicitly discussed

      Given that senescent cells are enlarged and accumulate in aged and tumour tissues, testing senescent models for erastin2 resistance would greatly strengthen the physiological significance.”

      We agree that an increase in cell size contributing to the resistance of senescent cells to ferroptosis is intriguing. We have now added a Supplementary Figure S3 and discussion of this point in the manuscript as follows:

      Discussion line 552: “…our data suggest that previously reported resistance of senescent cells to ferroptosis can at least partially be due to the increased cell size, a well-established hallmark of senescence.”

      Minor Comments

      (1) Mechanistic nuance regarding RSL3 should be included

      RSL3 has been reported to induce ferroptosis independently of GPX4 (PMID: 37087975, PMID: 40392234) and may target other selenoproteins such as TXNRD1. This nuance would help explain the observed divergence between RSL3 and erastin2 sensitivity across sizes.

      We have now added this in the Discussion as suggested by the reviewer (line 583):

      “In addition to this, it was reported that RSL3 can induce ferroptosis independently of GPX4 and may target other selenoproteins (DeAngelo et al, 2025; Cheff et al, 2023), which could also contribute to the difference in size-dependent responses to RSL3 and Era2.”

      (2) Dynamic range of BODIPY-C11 assays needs commentary

      Despite high erastin2 doses, the oxidized BODIPY signal remains close to DMSO levels. The authors should comment on whether this reflects high GSH buffering capacity, probe limitations, or other factors.”

      We believe there are both technical (narrow dynamic range of the probe) and biological reasons for the relatively small (2-3 fold) difference in Oxidized-to-Non-oxidized BODIPY-C11 ratios between DMSO and Era2-treated cells. The biological reason is that the cells continue producing GSH until they fully deplete the cystine pool, which happens ~20-24 h after Era2 addition. Once the cystine pool is depleted, the cells very rapidly deplete GSH and initiate cell death. Therefore, there is only a short time window where cells are strongly depleted of GSH before dying. We see this small fraction of cells with a high Oxidized BODIPY-C11 signal in our flow cytometry experiments and in previous microscopy analysis of BODIPY-C11 (Murray et al., Protocol for detection of ferroptosis in cultured cells. STAR Protoc. 2023), but at our chosen time point (20h Era2) most cells are not as bright because we aimed to analyze the population before the onset of widespread cell death.

      (3) Western blot for shCycD1 depletion should be included

      CycD1 depletion usually causes cells to stop proliferating, which is not the case here. Therefore, depletion must be partial. The level of depletion should be shown by immunblotting.”

      The CCND1 manipulation model is extensively characterized in our recent work cited in this manuscript (You et al. (2025), Cell size-dependent mRNA transcription drives proteome remodeling. 2025.10.30.685141 doi:10.1101/2025.10.30.685141). CCND1 shRNA cells do not fully arrest in G0/G1 because the concentration of Cyclin D1 protein in this system is only partially decreased, as the reviewer noted. As a result, the cells have a slightly elongated G1 phase due to a ~30% reduction in Cyclin D1 concentration, but continue to proliferate. The G1 fraction changes from ~70% in wild-type to ~80% in CCND1 shRNA cells.

      Reviewer #2 (Significance):

      General Assessment: This study presents a mechanistic link between cell size and ferroptosis susceptibility. Using high-throughput microscopy, proteomics, and genetic perturbations across multiple human cell lines, the authors demonstrate that larger cells are more resistant to ferroptosis induced by system xc<sup>-</sup> inhibition (erastin2). This resistance is attributed to elevated glutathione production, increased ferritinmediated iron sequestration, and reduced ACSL4-dependent lipid peroxidation. The experimental design is rigorous and multifaceted, with consistent results across cell types and size manipulation methods. While the study is limited to in vitro systems, its conceptual and mechanistic insights lay the groundwork for future in vivo and translational investigations.

      Advance: This work is the first to systematically show that cell size directly influences ferroptosis susceptibility via proteome scaling. It reconciles previous findings that large cells are sensitized to GPX4 inhibition (RSL3) by demonstrating that the ferroptosis pathway targeted system xc<sup>-</sup> vs GPX4 determines the direction of size-dependent vulnerability. The study provides a conceptual advance by positioning cell size as a regulatory axis in cell death decisions, and a mechanistic advance by identifying size-dependent changes in glutathione metabolism, ferritin levels, and ACSL4 expression.

      Audience: This research will be of interest to specialists in cell death, ferroptosis, redox biology, and cancer biology. It also holds relevance for aging researchers and translational scientists exploring ferroptosis-based therapies. The findings may influence how cell size heterogeneity is considered in therapeutic design, particularly in oncology and senescence-targeting strategies.

      Field of Expertise: Translational cancer biology, cell cycle regulation, proteomics, therapy resistance, molecular mechanisms of cell death.

      We thank Reviewer #2 for their careful and constructive assessment of our manuscript. We were happy that they appreciated the rigor of our multifaceted approach. We are also grateful for their thoughtful perspective on the conceptual and mechanistic advances, and for highlighting the broader relevance of this work to ferroptosis biology, redox regulation, cancer and aging research.

      Reviewer #3 (Evidence, reproducibility and clarity):

      In this manuscript, Zatulovskiy and colleagues elaborate on their previous work describing cell size-dependent changes in the proteome by investigating whether these changes can be correlated in differences in cell physiology. Using a cleverly-designed high throughput screen, they searched for compounds that differently-sized cells display differential sensitivity towards. Their primary hit, Era2, is involved in the ferroptosis pathway and serves as the starting point for a detailed study of how excess cell size protects cells from ferroptosis-induced cell death via: 1) lower concentrations of ACSL4 (which produces peroxidation-prone PUFAs), 2) increased ferritin concentrations, and 3) increased GSH concentrations.

      Overall, the experiments in this manuscript are well-designed and interpreted. It is an extremely well-written manuscript with a clear trajectory of logic. I have only a few major concerns that should be addressed before publication:

      We thank Reviewer #3 for their careful reading of the manuscript and for the clear summary of our study and its central findings. We appreciate their positive assessment of the experimental design, interpretation, and overall clarity of the writing and logical flow. We are also grateful for their constructive feedback and take their major concerns seriously; we have addressed each point in detail below.

      Major concerns:

      (1) In Figure 3E, the authors gate their flow cytometry data using SYTOX so that they are only analyzing live cells. Based on their gating scheme, it seems like there are really a lot of dead cells. Presumably the cells that died were the most sensitive to Era2, so it seems an oversight to discard these cells. Of course, it is not appropriate to analyze dead cells, but this could potentially be solved by using a shorter treatment duration than 24 hours wherein fewer cells die.”

      This is a good point. To address it, we have now replaced this panel with a time point where most cells are still alive (20 h, 0.2 µM Era2), as suggested by the reviewer (Fig. 3E,F). This did not change the conclusion that BODIPY-C11 oxidation decreases with cell size.

      (2) In Figure 5, are the small, medium, and large bins for ACSL4 KO cells the same as for WT cells? If the ACSL4 KO cells are just bigger to begin with, this could explain why the "small" bin has greater cell survival than the WT small bin. Moreover, is the overlap between the three bins the same in the WT and KO cells?

      This is an important point that we now address with data shown in Fig. S4B. We have now added a Supplementary Figure S4B to show the relative size of small, medium, and large WT and ACSL4 KO HMEC cells. As seen from this graph, the ACSL4 KO cells are not bigger than WT cells. Importantly, the fold-range between the small and large FACS-sorted cells is similar (~1.9 to 2-fold).

      (3) Loo, et al. Nat Comms 2025 similarly found that senescent cells (which are enlarged) are resistant to ferroptosis using the same inhibitor as the authors. In contrast to the authors, they show that this is due to lysosomal alkalinization and sequestration of ferrous iron in lysosomes. Given that Lanz et al. 2022 found that lysosomal components super-scale with cell size, it seems like this would be an important hypothesis to address. Free lysosomal iron can be easily measured with the LysoRhoNox stain. Loo et al. was able to restore ferroptosis sensitivity in senescent cells using the V-ATPase activator EN6, so it would be important for the authors to address whether this (or similar) treatment would have the same effect in enlarged cells.

      This is an excellent point. We have now performed this experiment and added it to the manuscript, as suggested by the reviewer. Based on the Lyso-FerroRed staining (another brand name for the LysoRhoNox probe), we do not see an increase in lysosomal iron sequestration in large cells (Fig. S2B):

      Line 301: “Previous work suggests a link between increased sequestration of ferrous iron in lysosomes and resistance to ferroptosis. It was reported that senescent cells, which are also large (Fig. S3A,B), gain resistance to ferroptosis through lysosomal alkalinization and sequestration of ferrous iron in lysosomes (Loo et al, 2025). We therefore tested whether the superscaling of lysosomes observed in large cells (Lanz et al, 2022; You et al, 2025) promotes Era2 resistance through lysosomal iron sequestration. To do this, we stained the cells with the lysosomal iron detection probe Lyso-FerroRed (Saimoto et al, 2025) and measured its scaling using flow cytometry (Fig. S2B). We observed that the amount of Lyso-FerroRed, and therefore, the amount of lysosomal iron, scaled in direct proportion to cell size, just like the total cellular protein content (Fig. S2B). These results indicate that iron chelation by ferritin and its sequestration in lysosomes are unlikely to play a crucial role in size-dependent decrease in Era2 sensitivity.”

      Minor concerns:

      (1) It would be helpful if this manuscript were re-submitted with line numbers to more easily reference the text.

      We have added line numbers for convenience.

      (2) In Figure 5A and other figures that reproduce data from Lanz et al. 2022, it would be helpful to have a summary curve for the overall abundance of each protein rather than only the individual peptide curves. These plots (particularly Figure 5A) are difficult to interpret since some peptides were presumably more abundant / measured with higher confidence than others.

      We have added the average ACSL4 protein slope line to Fig. 5A.

      (3) In Figure 5, the authors show the validation of the ACSL4 KO HT-1080 cell line but not HMEC, even though both are used in this figure. It would be useful to show both. Additionally, the authors switch back and forth between the two cell lines for this figure, and it is not clear why.

      We have added the HMEC ACSL4 KO validation Western blot in Fig. S4A.

      For the BODIPY oxidation experiment (Fig. 5D), we used HT-1080 instead of HMEC because HT1080 cells are sensitive to lower concentrations of Era2, and therefore, we could better optimize the Era2 concentrations and treatment durations to measure BODIPY oxidation at the time point when most cells are still alive but demonstrate a pronounced oxidized BODIPY signal.

      (4) In Figure 5B, the authors use antibody-based staining of ACSL4 and flow cytometry to correlate a loss of ACSL4 expression with increased cell size, validating the proteomics data in Figure 5A. This does not seem like a good way to do this. Firstly, fixing cells with formaldehyde alters their size (is this proportional across differently sized cells? It's impossible to know), which makes it inappropriate to use SSC as a proxy for size in this particular situation. Secondly, the normalization scheme here doesn't make sense. If actin was used as a reference protein, why was tubulin used to normalize ACSL4 abundance? Overall, this seems like a very round-about experiment that could have just been addressed by doing a simple western blot with the four size bins sorted from live cells (as it was in the proteomics). If the issue is that ACSL4 is not detectable by western in the HMEC cells, another solution would be plating the live, sorted bins on coverslips and measuring by IF (or using the HT-1080 cells).

      We prefer IF flow cytometry to Western blotting for protein scaling analysis because it is more quantitative and provides cell size and protein content information for each individual cell. While in principle, different-sized cells might change their size differently during fixation, the cells that were larger or smaller prior to the fixation remain larger or smaller after fixation as well.

      Therefore, the SSC measurement after fixation still provides reliable information on size ranking, even if SSC does not perfectly linearly scale with cell volume. We do not use the SSC information to calculate protein concentrations here. Instead, we divide the amount of our protein of interest in the cell by the amount of constitutively-expressed Tubulin, which acts as an analogue of a loading control in this experiment. In Fig. 5B, both ACSL4 and Actin were normalized to Tubulin to estimate their concentrations. Actin is used just as a reference protein to show how the concentration of a perfectly scaling protein remains constant across cell size, as opposed to the sub-scaling ACSL4. Tubulin in this case was used as a proxy for total cellular protein content, which scales linearly in proportion to cell volume. This approach for determining the scaling behaviors of different proteins was previously validated in Lanz et al., Mol Cell 2022.

      (5) In Figure 5E/5F, the authors pre-arrest the cells in G1 with palbociclib before size-sorting them. The pre-arrest is not done in other experiments using this cell line for sizesorting, so it would be important for the authors to comment on why this was done for this experiment but not others.”

      As we found in Fig. 2B-E, the cell cycle has confounding effects on size-dependent ferroptosis susceptibility measurements (as discussed in detail in our response to the first major point of Reviewer #1 above). Briefly, to avoid these confounding effects and isolate the effects of cell size from the effects of the cell cycle, we pre-synchronized the cells with 24 h treatment with palbociclib in Fig. 5E,F. This is now better clarified in the text, as follows:

      Line 456: “In this experiment, we synchronized cells in G1 phase using palbociclib prior to cell sorting and also incubated the sorted cells in the presence of palbociclib during Era2 treatment to isolate cell size effects from the previously observed confounding effects of the cell cycle on ferroptosis (Fig. 2B,E).”

      (6) Conceptually, it is difficult for me to understand why large cell size sensitizes cells to GPX4 inhibition but confers resistance to Era2 treatment. Particularly given the pathway described in Figure 3A, I am having trouble understanding why these would convey such opposing phenotypes. Shouldn't the extra ferritin in the bigger cells also help them cope with GPX4 inhibition if, as the authors state in the discussion, the increased sensitivity to the GPX4 inhibitor is reported to be mediated by (among other things) iron accumulation? A deeper discussion of this seeming-incongruity would be helpful for contextualizing the broader role of cell size in determining ferroptosis sensitivity.

      We agree this is an important point, which was also raised by the other reviewers. As such, we note that context-dependent (i.e., cell type-specific) effects are common in the ferroptosis field, and multiple groups including our own (Dixon) have published extensively on genes and mechanisms that can lead to differences between erastin2 and RSL3. For example, there are studies showing that the mTOR pathway or the p53 pathway can both prevent and promote ferroptosis, depending on the cell type or some other hidden variable.

      To better address the differences between Era2 and RSL3 in the context of the cell-sizedependent response, we have now added more data and discussion. In the Results section we added panel 4B and the following text:

      Line 359: “While the upregulation of GSH biosynthesis may promote the resistance of larger cells to ferroptosis, such an upregulation alone cannot explain why larger cells become more resistant to ferroptosis induced by the cystine import inhibitor Era2, but not, for example, by the GPX4 inhibitor RSL3 (Chan et al, 2025) (Figs. 2B, S1B). We found previously that upon mTORC1 inhibition cells can evade cystine deprivation-induced ferroptosis by uptake and catabolism of cysteine-rich extracellular proteins, mostly albumin (Armenta et al, 2022) (Fig. S3C). This process involves albumin degradation in lysosomes, predominantly by cathepsin B (CatB), and subsequent export of cystine from lysosomes to fuel the synthesis of glutathione. Large cells undergo proteome rearrangements similar to those occurring upon mTORC1 inhibition (Zatulovskiy et al, 2022). This suggests that large cells may upregulate CatB expression to bypass the Era2-induced cystine import inhibition via system xc-. To test this hypothesis, we used flow cytometry to measure how the expression of cathepsin B and the system xc- cystine/glutamate transporter SLC7A11 (xCT) scales with cell size (Fig. 4B). We found that SLC7A11 concentration modestly decreases, while CatB concentration significantly increases with cell size (Fig. 4B). This shift in the ratio between SLC7A11 and CatB supports the hypothesis that larger cells may rely less on cystine import via system xc- and thus become more resistant to system xc- inhibition by Era2.”

      Additionally, in the Discussion we added the following:

      Line 578: “We show that large cells may become resistant specifically to Era2 but not RSL3 through the upregulation of lysosomal function, particularly cathepsin B expression, which enables the uptake and catabolism of cysteine-rich extracellular proteins. A size-dependent shift in the ratio between SLC7A11 and cathepsin B makes large cells less dependent on cystine import via system xc-, and thus, more resistant to Era2. In addition to this, it was reported that RSL3 can induce ferroptosis independently of GPX4 and may target other selenoproteins (DeAngelo et al, 2025; Cheff et al, 2023), which could also contribute to the difference in sizedependent responses to RSL3 and Era2.”

    1. eLife Assessment

      This study thoroughly assesses tactile acuity on women's breasts, for which no dependable data currently exists. The study provides two important contributions, by convincingly showing that tactile acuity on the breast is poor in comparison to other body parts, and that acuity is worst in larger breasts, indicating that the number of tactile sensors is fixed. This study will be of interest to the broader community of touch, as well as those interested in breast reconstruction and sexual function.

    2. Reviewer #1 (Public review):

      [Editors' note: this version has been assessed by the Senior Editor without further input from the original reviewers. The authors have moderated their claims and discussed the limitations of their experimental design more transparently. The previous reviews are included for reference.]

      Comments on previous version:

      The authors investigated tactile spatial perception on the breast using discrimination, categorization, and direct localization tasks. They reach four main conclusions:

      (1) The breast has poor tactile spatial resolution.<br /> This conclusion is based on comparing just noticeable differences, a marker of tactile spatial resolution, across four body regions, two on the breast. The data compellingly support the conclusion; the study outshines other studies on tactile spatial resolution that tend to use problematic measures of tactile resolution, such as two-point-discrimination thresholds. The result will interest researchers in the field and possibly in other fields due to the intriguing tension between the finding and the sexually arousing function of touching the breast.

      The manuscript incorrectly describes the result as poor spatial acuity. Acuity measures the average absolute error, and acuity is good when response biases are absent. Precision relates to the error variance. It is common to see high precision with low acuity or vice versa. Just noticeable differences assess precision or spatial resolution, while points of subjective equality evaluate acuity or bias. Similar confusions between these terms appear throughout the manuscript.

      A paragraph within the next section seems to follow up on this insight by examining the across-participant consistency of the differences in tactile spatial resolution between body parts. To this aim, pairwise rank correlations between body sites are conducted. This analysis raises red flags from a statistical point of view. 1) An ANOVA and its follow-up tests assume no variation in the size of the tested effect but varying base values across participants. Thus, if significant differences between conditions are confirmed by the original statistical analysis, most participants will have better spatial resolution in one condition than the other condition, and the difference between body sites will be similar across participants. 2) Correlations are power-hungry, and non-parametric tests are power-hungry. Thus, the number of participants needed for a reliable rank correlation analysis far exceeds that of the study. In sum, a correlation should emerge between body sites associated with significantly different tactile JNDs; however, these correlations might only be significant for body sites with pronounced differences due to the sample size.

      (2) Larger breasts are associated with lower tactile spatial resolution<br /> This conclusion is based on a strong correlation between participants' JNDs and the size of their breasts. The depicted correlation convincingly supports the conclusion. The sample size is below that recommended for correlations based on power analyses, but simulations show that spurious correlations of the reported size are extremely unlikely at N=18. Moreover, visual inspection rules out that outliers drive these correlations. Thus, they are convincing. This result is of interest to the field, as it aligns with the hypothesis that nerve fibers are more sparsely distributed across larger body parts.

      (3) The nipple is a unit<br /> The data do not support this conclusion. The conclusion that the nipple is perceived as a unit is based on poor tactile localization performance for touches on the nipple compared to the areola. The problem is that the localization task is a quadrant identification task with the center being at the nipple. Quadrants for the areola could be significantly larger due to the relative size of the areola and the nipple; the results section seems to suggest this was accounted for when placing the tactile stimuli within the quadrants, but the methods section suggests otherwise. Additionally, the areola has an advantage because of its distance from the nipple, which leads to larger Euclidean distances between the centers of the quadrants than for the nipple. Thus, participants should do better for the areola than for the nipple even if both sites have the same tactile resolution.

      To justify the conclusion that the nipple is a unit, additional data would be required. 1) One could compare psychometric curves with the nipple as the center and psychometric curves with a nearby point on the areola as the center. 2) Performance in the quadrant task could be compared for the nipple and an equally sized portion of the areola and tactile locations that have the same distance to the border between quadrants in skin coordinates. 3) Tactile resolution could be directly measured for both body sites using a tactile orientation task with either a two-dot probe or a haptic grating.

      Categorization accuracy in each area was tested against chance using a Monte Carlo test, which is fine, though the calculation of the test statistic, Z, should be reported in the Methods section, as there are several options. Localization accuracies are then compared between areas using a paired t-test. It is a bit confusing that once a distribution-approximating test is used, and once a test that assumes Gaussian distributions when the data is Bernoulli/Binomial distributed. Sampling-based and t-tests are very robust, so these surprising choices should have hardly any effect on the results.

      A correlation based on N=4 participants is dangerously underpowered. A quick simulation shows that correlation coefficients of randomly sampled numbers are uniformly distributed at such a low sample size. This likely spurious correlation is not analyzed, but quite prominently featured in a figure and discussed in the text, which is worrisome.

      (4) Localization of tactile events on the breast is biased towards the nipple<br /> The conclusion that tactile percepts are drawn toward the nipple is based on localization biases for tactile stimuli on the breast compared to the back. Unfortunately, the way participants reported the tactile locations introduces a major confound. Participants indicated the perceived locations of the tactile stimulus on 3D models of these body parts. The nipple is a highly distinctive and cognitively represented landmark, far more so than the scapula, making it very likely that responses were biased toward the nipple regardless of the actual percepts. One imperfect but better alternative would have been to ask participants to identify locations on a neutral grey patch and help them relate this patch to their skin by repeatedly tracing its outline on the skin.

      Participants also saw their localization responses for the previously touched locations. This is unlikely to induce bias towards the nipple, but it renders any estimate of the size and variance of the errors unreliable. Participants will always make sure that the marked locations are sufficiently distant from each other.

      The statistical analysis is again a homebrew solution and hard to follow. It remains unclear why standard and straightforward measures of bias, such as regressing reported against actual locations, were not used.

      Null-hypothesis significance testing only lets scientists either reject the null hypothesis or not. The latter does NOT mean the Null hypothesis is true, i.e., it can never be concluded that there is no effect. This rule applies to every NHST test. However, it raises particular concerns with distribution tests. The only conclusion possible is that the data are unlikely from a population with the tested distribution; these tests do not provide insight into the actual distribution of the data, regardless of whether the result is significant or not.

    3. Reviewer #2 (Public review):

      Summary:

      The authors tested tactile acuity on the breast of females using several tasks.

      Results:

      Tactile acuity, assessed by just-noticeable differences in judging whether a touch was above or below a comparison stimulus, was lower on both the lateral and medial breast than on the hand and back. Acuity also scaled inversely with breast size, echoing earlier findings that larger hands exhibit lower acuity, presumably because a similar number of tactile receptors must be distributed over larger or smaller body surfaces. Observing this principle in the breast as on the hand strengthens the view that fixed innervation is a general organizing principle of the tactile system. Both methodology and analysis appear sound.

      Most participants were unable to localize touch to a specific quadrant of the nipple, suggesting it is perceived as a single tactile unit. However, the study does not address whether touches to the nipple and areola are confused; conceptualizing the nipple as a perceptual (landmark) unit would suggest that such confusion should not take place. Aside from this limitation, the methodology and analysis appear sound.

      Absolute touch localization, assessed by asking participants to indicate locations on a 3D rendering of their own torso, revealed a bias toward the nipple. The authors interpret this as evidence that the nipple serves as a landmark attracting perceived touch. However, as reviewers noted during review, alternative explanations cannot be fully ruled out: because the stimulus array was centered on the nipple, the observed bias may stem from stimulus distribution rather than landmark status. Aside from this caveat, the methodology and analysis appear sound.

      Overall assessment:

      The study offers a welcome exception to the prevailing bias in tactile research that limits investigation to the hand and arm. Its support for the fixed innervation hypothesis and its suggestion that the nipple may serve as a potential landmark-though requiring further scrutiny-illustrate the value of extending research to other body regions. By employing multiple tasks, the authors address several key aspects of tactile perception and create links to earlier findings.

    4. Author response:

      The following is the authors’ response to the previous reviews

      Public Reviews:

      Reviewer #1 (Public review):

      The manuscript incorrectly describes the result as poor spatial acuity. Acuity measures the average absolute error, and acuity is good when response biases are absent. Precision relates to the error variance. It is common to see high precision with low acuity or vice versa. Just noticeable differences assess precision or spatial resolution, while points of subjective equality evaluate acuity or bias. Similar confusions between these terms appear throughout the manuscript.

      While I do not agree with the reviewer's usage of the word “acuity” and a cursory Google search does not agree with the provided definition, I have replaced acuity with precision as appropriate to improve clarity.

      A paragraph within the next section seems to follow up on this insight by examining the across-participant consistency of the differences in tactile spatial resolution between body parts. To this aim, pairwise rank correlations between body sites are conducted. This analysis raises red flags from a statistical point of view. 1) An ANOVA and its follow-up tests assume no variation in the size of the tested effect but varying base values across participants. Thus, if significant differences between conditions are confirmed by the original statistical analysis, most participants will have better spatial resolution in one condition than the other condition, and the difference between body sites will be similar across participants. 2) Correlations are power-hungry, and non-parametric tests are power-hungry. Thus, the number of participants needed for a reliable rank correlation analysis far exceeds that of the study. In sum, a correlation should emerge between body sites associated with significantly different tactile JNDs; however, these correlations might only be significant for body sites with pronounced differences due to the sample size.

      We have entirely removed this result from both the text and supplement.

      The data do not support this conclusion. The conclusion that the nipple is perceived as a unit is based on poor tactile localization performance for touches on the nipple compared to the areola. The problem is that the localization task is a quadrant identification task with the center being at the nipple. Quadrants for the areola could be significantly larger due to the relative size of the areola and the nipple; the results section seems to suggest this was accounted for when placing the tactile stimuli within the quadrants, but the methods section suggests otherwise. Additionally, the areola has an advantage because of its distance from the nipple, which leads to larger Euclidean distances between the centers of the quadrants than for the nipple. Thus, participants should do better for the areola than for the nipple even if both sites have the same tactile resolution.

      We agree with this interpretation and have updated the language throughout.

      Categorization accuracy in each area was tested against chance using a Monte Carlo test, which is fine, though the calculation of the test statistic, Z, should be reported in the Methods section, as there are several options. Localization accuracies are then compared between areas using a paired t-test. It is a bit confusing that once a distribution-approximating test is used, and once a test that assumes Gaussian distributions when the data is Bernoulli/Binomial distributed. Sampling-based and t-tests are very robust, so these surprising choices should have hardly any effect on the results.

      Excellent point. We have replaced the paired t-test with a signed rank test and added text to the methods to expand upon this.

      A correlation based on N=4 participants is dangerously underpowered. A quick simulation shows that correlation coefficients of randomly sampled numbers are uniformly distributed at such a low sample size. This likely spurious correlation is not analyzed, but quite prominently featured in a figure and discussed in the text, which is worrisome.

      We have removed this panel to reduce this concern.

      The conclusion that tactile percepts are drawn toward the nipple is based on localization biases for tactile stimuli on the breast compared to the back. Unfortunately, the way participants reported the tactile locations introduces a major confound. Participants indicated the perceived locations of the tactile stimulus on 3D models of these body parts. The nipple is a highly distinctive and cognitively represented landmark, far more so than the scapula, making it very likely that responses were biased toward the nipple regardless of the actual percepts. One imperfect but better alternative would have been to ask participants to identify locations on a neutral grey patch and help them relate this patch to their skin by repeatedly tracing its outline on the skin.

      While I wholeheartedly agree with the sentiments of the reviewer, in our experience performing these tests across many women we have found that the variability of the morphology of the breast makes it incredibly hard for women to perform this task in the way the reviewer is describing. Consequently, there is likely no perfect version of the task. That said, we have endeavored to acknowledge the limitations of the approach in the discussion.

      Participants also saw their localization responses for the previously touched locations. This is unlikely to induce bias towards the nipple, but it renders any estimate of the size and variance of the errors unreliable. Participants will always make sure that the marked locations are sufficiently distant from each other.

      I again respectfully disagree with this interpretation. If the participants were to always make sure marked locations were sufficiently distant from each other then the degree of error and bias would be similar between regions given that the visual pattern would be almost identical. As this is not true in the data, I disagree with the premise, though we hope the changes to the discussion acknowledge limitations with the data collection method.

      Null-hypothesis significance testing only lets scientists either reject the null hypothesis or not. The latter does NOT mean the Null hypothesis is true, i.e., it can never be concluded that there is no effect. This rule applies to every NHST test. However, it raises particular concerns with distribution tests. The only conclusion possible is that the data are unlikely from a population with the tested distribution; these tests do not provide insight into the actual distribution of the data, regardless of whether the result is significant or not.

      Thank you for this comment. We have updated the language to make it explicit that we do not mean to imply failing to deviate from the Null distribution does not mean that they are in fact Null in nature.

      Recommendations for the authors:

      Reviewer #2 (Recommendations for the authors):

      I am wondering whether the interpretation of "the nipple as a sensory unit" is also supported by localization performance as reported in the analysis around Fig. 3 and supplementary Fig. 2. I cannot really see the error lines in that figure, and cannot tell whether any of the touches were on the nipple proper. Specifically I am wondering whether touch to the nipple is reliably attributed to the nipple, and touch to the areola to the areola, or whether confusion exists between the two. The description of the nipple as a sensory unit implies reliable attribution of touch to the respective area. Also the discussion (lines 309ff) is ambiguous about this.

      Thank you for this comment. We have removed language about the nipple being a unit and reframed the text in the discussion. We have also clarified that touches were indeed on the nipple.

      typos etc.

      lines 68-71 - implied causality is not backed up by evidence and could be the other way around than stated here

      line 82 grammar is inconsistent

      lines 199-200, "on the nipple" occurs twice

      Thank you for catching these. We have addressed the typos and grammar. We have also added a citation to the sentence where this exact hypothesis is stated. We have also relaxed the language to imply it is indeed a hypothesis.

    1. eLife Assessment

      This important work demonstrates the role of physically linking the core and CTD kinase modules of TFIIH via separate domains of subunit Tfb3 in confining RNA Polymerase II Serine 5 CTD phosphorylation to promoter regions of transcribed genes in budding yeast. The main findings, resulting from analyses of viable Tfb3 mutants in which the linkage between TFIIH core and kinase modules has been severed, are supported by solid evidence from in vitro and in vivo experiments. The new findings raise the intriguing possibility that the Tfb3-mediated connection between core and kinase modules of TFIIH is an evolutionary addition to an ancestral state of physically unconnected enzymes.

    2. Reviewer #1 (Public review):

      [Editors' note: this version has been assessed by the Reviewing Editor without further input from the original reviewers. The authors have addressed the comments raised in the previous rounds of review.]

      Giordano et al. demonstrate that yeast cells expressing separated N- and C-terminal regions of Tfb3 are viable and grow well. Using this creative and powerful tool, the authors effectively uncouple CTD Ser5 phosphorylation at promoters and assess its impact on transcription. This strategy is complementary to previous approaches, such as Kin28 depletion or the use of CDK7 inhibitors. The results are largely consistent with earlier studies, reinforcing the importance of the Tfb3 linkage in mediating CTD Ser5 phosphorylation at promoters and subsequent transcription.

      Notably, the authors also observe effects attributable to the Tfb3 linker itself, beyond its role as a simple physical connection between the N- and C-terminal domains. These findings provide functional insight into the Tfb3 linker, which had previously been observed in structural studies but lacked clear functional relevance. Overall, I am very positive about the publication of this manuscript.

    3. Reviewer #2 (Public review):

      Summary:

      This work advances our understanding of how TFIIH coordinates DNA melting and CTD phosphorylation during transcription initiation. The finding that untethered kinase activity becomes "unfocused," phosphorylating the CTD at ser5 throughout the coding sequence rather than being promoter-restricted, suggests that the TFIIH Core-Kinase linkage not only targets the kinase to promoters but also constrains its activity in a spatial and temporal manner.

      Strengths:

      The experiments presented are straightforward and the model for coupling initiation and CTD phosphorylation and for evolution of these linked processes are interesting and novel. The results have important implications for the regulation of initiation and CTD phosphorylation.

    4. Reviewer #3 (Public review):

      Summary:

      Eukaryotic gene transcription requires a large assemblage of protein complexes that govern the molecular events required for RNA Polymerase II to produce mRNAs. One of these complexes, TFIIH, comprises two modules, one of which promotes DNA unwinding at promoters, while the other contains a kinase (Kin28 in yeast) that phosphorylates the repeated motif at the C-terminal domain (CTD) of the largest subunit of Pol II. Kin28 phosphorylation of Ser5 in the YSPTSPS motif of the CTD is normally highly localized at promoter regions, and marks the beginning of a cycle of phosphorylation events and accompanying protein association with the CTD during the transition from initiation to elongation.

      The two modules of TFIIH are linked by Tfb3. Tfb3 consists of two globular regions, an N-terminal domain that contacts the Core module of TFIIH and a C-terminal domain that contacts the kinase module, connected by a linker. In this paper, Giordano et al. test the role of Tfb3 as a connector between the two modules of TFIIH in yeast. They show that while no or very slow growth occurs if only the C-terminal or N-terminal region of Tfb3 is present, near normal growth is observed when the two unlinked regions are expressed. Consistent with this result, the separate domains are shown to interact with the two distinct TFIIH modules. ChIP experiments show that the Core module of TFIIH maintains its localization at gene promoters when the Tfb3 domains are separated, while localization of the kinase module, and of Ser5 phosphorylation on the CTD of Pol II, is disrupted. Finally, the authors examine the effect of separating the Tfb3 domains on another function of TFIIH, namely nucleotide excision repair, and find little or no effect when only the N-terminal region of Tfb3 or the two unlinked domains are present.

      Strengths:

      Experiments involving expression of Tfb3 domains in yeast are well-controlled and the data regarding viability, interaction of the separate Tfb3 domains with TFIIH modules, genome-wide localization of the TFIIH modules and of phosphorylated Ser5 CTDs, and of effects on NER, are convincing. The experiments are consistent with current models of TFIIH structure and function and support a model in which Tfb3 tethers the kinase module of TFIIH close to initiation sites to prevent its promiscuous action on elongating Pol II.

    5. Author response:

      The following is the authors’ response to the previous reviews

      eLife Assessment

      This important work demonstrates the role of physically linking the core and CTD kinase modules of TFIIH via separate domains of subunit Tfb3 in confining RNA Polymerase II Serine 5 CTD phosphorylation to promoter regions of transcribed genes in budding yeast. The main findings, resulting from analyses of viable Tfb3 mutants in which the linkage between TFIIH core and kinase modules has been severed, are supported by solid evidence from in vitro and in vivo experiments. The new findings raise the intriguing possibility that the Tfb3-mediated connection between core and kinase modules of TFIIH is an evolutionary addition to an ancestral state of physically unconnected enzymes.

      After consultation with the referees, we would like to suggest that you insert text into the RESULTS section acknowledging two limitations of your findings remaining in the revised manuscript, as follows:

      (i) It remains possible that Kin28 abundance was reduced by splitting Tfb3, which could be a factor in reducing its occupancies at gene promoters.

      In response, the paper now contains the following sentence:

      “Kin28 levels in extracts were below the limit of detection for our antibody, so we cannot rule out that the drop in ChIP signal is partly due to reduced Kin28 levels in the split Tfb3 strains. However, the viability of the cells (Figure 2) and the Tfb3-TAP purifications (Figure 3) argue against a complete loss of Kin28.”

      (ii) Lower than wild-type expression of the Tfb3 truncations might contribute to their mutant phenotypes shown in Figs. 2 & 5.

      In response, the paper now contains the following sentence:

      “There was some variation in protein expression levels (Figure 3A, left panel, lanes 1-4), and reduced levels of the split Tfb3 may contribute to the slow growth phenotypes.”

      Public Reviews:

      Reviewer #1 (Public review):

      Giordano et al. demonstrate that yeast cells expressing separated N- and C-terminal regions of Tfb3 are viable and grow well. Using this creative and powerful tool, the authors effectively uncouple CTD Ser5 phosphorylation at promoters and assess its impact on transcription. This strategy is complementary to previous approaches, such as Kin28 depletion or the use of CDK7 inhibitors. The results are largely consistent with earlier studies, reinforcing the importance of the Tfb3 linkage in mediating CTD Ser5 phosphorylation at promoters and subsequent transcription.

      Notably, the authors also observe effects attributable to the Tfb3 linker itself, beyond its role as a simple physical connection between the N- and C-terminal domains. These findings provide functional insight into the Tfb3 linker, which had previously been observed in structural studies but lacked clear functional relevance. Overall, I am very positive about the publication of this manuscript and offer a few minor comments below that may help to further strengthen the study.

      We appreciate the reviewer’s positive assessment of our work and suggestions for improvement.

      Page 4 PIC structures show the linker emerging from the N-terminal domain as a long alpha-helix running along the interface between the two ATPase subunits, followed by a turn and a short stretch of helix just N-terminal to a disordered region that connects to the C-terminal region (see schematic in Fig. 1A).

      The linker helix was only observed in the poised PIC (Abril-Garrido et al., 2023), not other fully-engaged PIC structures.

      Thanks for clarifying. We note that some structures of TFIIH alone also see the long helix. Accordingly, we modified this section to read:

      “In many TFIIH and PIC structures the linker is not visible, presumably due to flexibility. However, when it is seen (Abril-Garrido et al., 2023; Greber et al., 2019), the linker emerges from the N-terminal domain as a long alpha-helix running along the interface between the two ATPase subunits…”

      Page 8 Recent structures (reviewed in (Yu et al., 2023)) show that the Kinase Module would block interactions between the Core Module and other NER factors. Therefore, TFIIH either enters into the NER complex as free Core Module, or the Kinase Module must dissociate soon after.

      To my knowledge, this is still controversial in the NER field. I note the potential function on the kinase module is likely attributed to the N-terminal region of Tfb3 through its binding to Rad3.

      We are not experts on NER, but in reviews of the field this appears to be a widely held assumption. A 2008 paper from the Egly lab (Coin et al., DOI 10.1016/j.molcel.2008.04.024) is usually cited, which shows that the interaction between XPD (metazoan Rad3) and XPA is likely incompatible with XPD-MAT1 interaction. In addition to the Yu 2023 review, we now also cite a more recent publication that more extensively reviews the models for core TFIIH interactions (van Sluis et al, 2025). We looked at the multiple recently published structures of various TCR-NER and GG-NER intermediate complexes, and none of them show the CAK module or even the Tfb3/Mat1 N-term, even though those proteins were typically included during assembly. We also consulted with our colleagues Johannes Walter and Lucas Farnung, who are studying various TC-NER intermediates biochemically and structurally. Although the CAK module is included in their assembly reactions, it is not visible in their cryoEM structures. They tell me that the presence of CAK would be compatible with early TC-NER intermediates, but is predicted to overlap with later interactions of XPD with the TC-NER factor STK19 (see Mevissen et al., Cell 2024). To be conservative, we modified the sentence to say “Recent structures … suggest” rather than “show”.

      Because the yeast strains used in Fig. 6 retain the N-terminal region of Tfb3, the UV sensitivity assay presented here is unlikely to directly address the contribution of the kinase module to NER.

      We agree that our experiment only shows that the connection between Tfb3 N- and C-term domains is not necessary for NER. The individual domains might still be able to function independently. Accordingly, we changed the heading of that section from “Disconnected core TFIIH does not cause an NER defect” to “Split Tfb3 does not cause an NER defect.” This more closely matches the figure legend title.

      Page 11. Notably, release of the Tfb3 Linker contact also results in the long alpha-helix becoming disordered (Abril-Garrido et al., 2023), which could allow the kinase access to a far larger radius of area. This flexibility could help the kinase reach both proximal and distal repeats within the CTD, which can theoretically extend quite far from the RNApII body.

      Although the kinase module was resolved at low resolution in all PIC-Mediator structures, these structural studies consistently reveal the same overall positioning of the kinase module on Mediator, indicating that its localization is constrained rather than variable. This observation suggests that the linker region may help position the kinase module at this specific site, likely through direct interactions with the PIC or Mediator. This idea is further supported by numerous cross-links between the linker region and Mediator (Robinson et al., 2016).

      That is true. But please note that this sentence was meant to describe movement of the kinase module AFTER release from Mediator (see previous sentence). Re-reading the passage, we realized the confusion is because we propose multiple possible pathways in that paragraph. In the first half, we suggest the capture of the kinase module by Mediator might trigger the conformation changes in the linker. In the second half (where it says “Alternatively….”) we suggest the Mediator-CAK interaction could instead come first, and the release of this contact could free the CAK module to move around. We have modified the paragraph to make it clear these are two different distinct models.

      Comments on revisions:

      Revised ms clarified all my points, including those I previously misunderstood.

      Thanks again for helping us improve the manuscript.

      Reviewer #2 (Public review):

      Summary:

      This work advances our understanding of how TFIIH coordinates DNA melting and CTD phosphorylation during transcription initiation. The finding that untethered kinase activity becomes "unfocused," phosphorylating the CTD at ser5 throughout the coding sequence rather than being promoter-restricted, suggests that the TFIIH Core-Kinase linkage not only targets the kinase to promoters but also constrains its activity in a spatial and temporal manner.

      Strengths:

      The experiments presented are straightforward and the model for coupling initiation and CTD phosphorylation and for evolution of these linked processes are interesting and novel. The results have important implications for the regulation of initiation and CTD phosphorylation.

      Comments on revisions:

      The revised version with revisions to figures, text and new data has addressed all of our prior comments.

      We thank the reviewer for helping us improve the paper.

      Reviewer #3 (Public review):

      Summary:

      Eukaryotic gene transcription requires a large assemblage of protein complexes that govern the molecular events required for RNA Polymerase II to produce mRNAs. One of these complexes, TFIIH, comprises two modules, one of which promotes DNA unwinding at promoters, while the other contains a kinase (Kin28 in yeast) that phosphorylates the repeated motif at the C-terminal domain (CTD) of the largest subunit of Pol II. Kin28 phosphorylation of Ser5 in the YSPTSPS motif of the CTD is normally highly localized at promoter regions, and marks the beginning of a cycle of phosphorylation events and accompanying protein association with the CTD during the transition from initiation to elongation.

      The two modules of TFIIH are linked by Tfb3. Tfb3 consists of two globular regions, an N-terminal domain that contacts the Core module of TFIIH and a C-terminal domain that contacts the kinase module, connected by a linker. In this paper, Giordano et al. test the role of Tfb3 as a connector between the two modules of TFIIH in yeast. They show that while no or very slow growth occurs if only the C-terminal or N-terminal region of Tfb3 is present, near normal growth is observed when the two unlinked regions are expressed. Consistent with this result, the separate domains are shown to interact with the two distinct TFIIH modules. ChIP experiments show that the Core module of TFIIH maintains its localization at gene promoters when the Tfb3 domains are separated, while localization of the kinase module, and of Ser5 phosphorylation on the CTD of Pol II, is disrupted. Finally, the authors examine the effect of separating the Tfb3 domains on another function of TFIIH, namely nucleotide excision repair, and find little or no effect when only the N-terminal region of Tfb3 or the two unlinked domains are present.

      Strengths:

      Experiments involving expression of Tfb3 domains in yeast are well-controlled and the data regarding viability, interaction of the separate Tfb3 domains with TFIIH modules, genome-wide localization of the TFIIH modules and of phosphorylated Ser5 CTDs, and of effects on NER, are convincing. The experiments are consistent with current models of TFIIH structure and function and support a model in which Tfb3 tethers the kinase module of TFIIH close to initiation sites to prevent its promiscuous action on elongating Pol II.

      We appreciate that the reviewer finds that our main conclusions are convincing.

      Weaknesses:

      The work is limited in scope and does not provide major insights into the mechanism of transcription. The main addition to current models of transcription is that tethering of Kin28 to Tfb3 may limit kinase action from occurring downstream from the initiation site.

      The first described experiment, which purports to show that three kinases cannot function in place of Kin28 when tethered (by fusion) to Tfb3 is missing the crucial control of showing that Kin28 can support viability in the same context. This result also does not connect with the rest of the manuscript, although the experiment apparently motivated the subsequent studies reported here.

      We elected not to do this control experiment for several reasons. As reviewer 3 points out, this kinase fusion experiment turned out to be somewhat disconnected from the rest of the paper. Even though it didn’t work, we included it in the paper because the results led us to the realization that the Tfb3 C-term was actually not fully essential for viability as reported, which in turn led us to the idea of splitting Tfb3. Structural studies (https://doi.org/10.1126/sciadv.abd4420, https://doi.org/10.1073/pnas.2009627117, https://doi.org/10.7554/eLife.44771) show that, in addition to providing linkage to the core module, the C-term of Tfb3 induces a conformation change in Kin28/Cdk7 necessary for full kinase activity (which is likely why the strains without C-term are just barely viable). If we were to pursue why the fusions didn’t work, we could tether Kin28 directly to the Tfb3 linker (and may try this in the future), but then would need to also express the C-term separately for its activating function. Even then, this would be an imperfect control for the fusion experiments in Figure 1. Because were trying to best mimic Kin28 being tethered via the accessory subunit Tfb3/Mat1, in the Figure 1 experiment we did not directly attach the kinases to Tfb3. For Ctk1/Cdk12, we fused the Tfb3 linker to the Ctk3 accessory subunit (analogous to Tfb3), and for Bur1/Cdk9, we fused to the cyclin subunit Bur2 (there is no known third subunit in this complex). The one exception was Mpk1, which has no partner subunits and is not a CDK. There are many reasons why this high-risk protein fusion experiment may not have worked, but chose not to pursue it further at this time.

      Finally, the authors present the interesting and reasonable speculation that the TFIIH complex and connecting Tfb3 found in mammals and yeast may have evolved from an earlier state in which the two TFIIH subdomains were present as unconnected, distinct enzymes. It will be interesting to have this idea tested more thoroughly as more molecular evolutionary data becomes available.

      Comments on revisions:

      For the most part, the authors have satisfactorily addressed my previous critique. In particular, they have added to their discussion of evolutionary implications, and performed an experiment casting doubt on the assertion of a dominant negative effect, and as a consequence removed this claim from the manuscript. I also pointed out that the fusion experiments that lead off the Results section are missing the crucial control of including a Tfb3-Kin28 fusion. The authors have elected not to perform this control experiment, pointing out that even this control would be imperfect in some respects, and agreeing that this experiment is somewhat disconnected from the rest of the paper. The reason for including it, in spite of its somewhat tangential nature, is that it provides something of a rationale for the experiments that follow. I don't so much mind their retaining the experiment, as the absence of this control (and indeed, the results) does not so much impact the later results. However, I think if it is to be included, this shortcoming should be explicitly recognized, especially as a service to younger scientists who could benefit from an exposition that includes a thorough consideration of potential control experimenents.

      We thank the reviewer for helping us improve the paper.

    1. eLife Assessment

      This manuscript reports a high-quality genome assembly of the European cuttlefish, Sepia officinalis, a representative species of the Cephalopod lineage. This solid work relies on current best practices in genome sequencing and assembly, combining PacBio HiFi long reads and Hi-C chromatin conformation capture, and on state-of-the-art comparative genomic analyses, including chromosome number evolution and analyses of expanded gene families. The resulting genome will be a valuable resource for researchers interested in cuttlefish biology and comparative genomics in general.

    2. Reviewer #1 (Public review):

      [Editors' note: this version has been assessed by the Reviewing Editor without further input from the original reviewers. The authors have carefully considered all the reviewers' comments. The newly added analyses, figures, and text sections are of high quality, and we commend the authors for their in-depth revision of the manuscript.]

      This manuscript presents a high-quality, chromosome-level genome assembly of the European cuttlefish (Sepia officinalis), a representative species of the cephalopod lineage. Using state-of-the-art sequencing and scaffolding technologies -including PacBio HiFi long reads and Hi-C chromatin conformation capture - the authors deliver a genome assembly with exceptional contiguity and completeness, as evidenced by high BUSCO scores. This genome resource fills a significant gap in cephalopod genomics and offers a valuable foundation for studies in neurobiology, behavior, and evolutionary biology. However, there are several major aspects that need to be strengthened.

    3. Reviewer #2 (Public review):

      This paper concerns an interesting organism, Sepia officinalis. However, in the opinion of this reviewer, the paper reads somewhat like a genome report. The authors have used 23x PacBio HiFi in conjunction with relatively low coverage (11x) Hi-C to scaffold the genome into a karyotype of 47 chromosomes. They have used a combination of short and long read RNA seq to annotate the genome in what looks like a very good annotation. The paper offers basic analyses of the Busco evaluation, some descriptive analyses of gene family and repeat content, and a bit more focused analysis on synteny among sequenced squids. Generally, the data will be useful.

    4. Reviewer #3 (Public review):

      Summary:

      In this study, authors Simone Rencken and co-authors present and investigate the genome of the common cuttlefish Sepia officinalis.

      Strengths:

      The authors explain in a detailed yet concise manner the main steps for a genome assembly, with very robust methods for validation, and according to current best practices. In addition to the chromosomal assembly, the authors confirmed the presence of 47 chromosomes using Hi-C data and multiple species synteny. They also generated a comprehensive gene annotation, with assessments of gene completeness, providing a useful resource for the community of researchers interested in cuttlefish biology and comparative genomics.

    5. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer 1 (Public review):

      Summary:

      This manuscript presents a high-quality, chromosome-level genome assembly of the European cuttlefish (Sepia officinalis), a representative species of the cephalopod lineage. Using state-of-the-art sequencing and scaffolding technologies -including PacBio HiFi long reads and Hi-C chromatin conformation capture - the authors deliver a genome assembly with exceptional contiguity and completeness, as evidenced by high BUSCO scores. This genome resource fills a significant gap in cephalopod genomics and offers a valuable foundation for studies in neurobiology, behavior, and evolutionary biology. However, there are several major aspects that need to be strengthened.

      Major Revisions Recommended:

      (1) Single-individual genome limitation

      The genome assembly is based on a single individual, which appears to be male. While this approach is common in genome projects, it does not capture the full genetic diversity of the species. As S. officinalis exhibits a wide geographical range and possible population structure, future efforts (or discussion in this manuscript) should consider re-sequencing multiple individuals - of both sexes and from diverse geographic origins - to characterize population-level variation, sex-linked features, and structural polymorphisms.

      We thank the reviewer for this summary and the important point raised. While sequencing additional individuals, unfortunately, lies outside the scope of our study, we used the published data from the DToL assembly (from a male individual from a different geographical origin) to begin to investigate their differences.

      First, we attempted to create a mixed assembly from both datasets, as also suggested by Reviewer 2, to increase data coverage and genetic information. Even though the heterozygosity estimate is quite low (ca. 1%), the mixed assembly produced severely inflated and fragmented results, yielding an assembly ca. 3× larger than expected, with the top 46 contigs covering only ~5% of the total length - a sign of over duplication and failed haplotype collapse.

      This result is not surprising when considering the assembly algorithms: most programs, including hifiasm used in this study, assume a single diploid individual (or a trio assembly including data from both parents), so using multiple individuals breaks this assumption. Assembly pipelines infer homozygous/heterozygous coverage cutoffs from the k-mer histogram. Mixing individuals raises apparent heterozygosity far above true diploid levels, turning the expected bimodal k-mer profile into a complex multimodal distribution. This misleads the phasing and purging steps in the assembly pipeline, causing over-expansion and fragmentation of the assembly.

      Second, we created separate assemblies from the raw data sets of MPIBR and DToL using the exact same pipeline and parameters to avoid the technical problem described above. These assemblies are directly comparable, and after aligning them, it is possible to build a pangenome graph that we believe would help to address the points raised by the reviewer. Pangenome graphs can represent cross-individual variation more accurately and improve read alignment in regions of high genomic variation, which can aid population-level analyses [1]. We agree on the importance of this work, yet collecting data from more individuals and the construction and analysis of a pangenome graph lies beyond the scope of this manuscript and should be part of future efforts by the cephalopod genomics field.

      (2) Limited experimental validation of chromosomal inferences

      The study reports chromosome-scale scaffolding using Hi-C data and proposes a revised karyotype for S. officinalis. However, these inferences would be significantly strengthened by orthogonal validation methods. In particular, fluorescence in situ hybridization (FISH) or karyotyping from cytogenetic preparations would provide direct confirmation of chromosome number and structural arrangements. The reliance solely on Hi-C contact maps for inferring chromosomal organization should be acknowledged as a limitation or supplemented with such validations.

      We appreciate the reviewer’s point regarding the value of orthogonal validation methods to support the chromosome-scale scaffolding and proposed karyotype. We acknowledge that relying solely on Hi-C contact maps to infer chromosome number and structure presents limitations, as also becomes apparent in our detailed analysis of both S. officinalis genome assemblies (in Figure 2 and Supplementary Figure 3 of the revised manuscript). We attempted to complement these analyses with cytogenetic approaches. Unfortunately, the availability of suitable mitotic tissue was limited. Moreover, our karyotyping trials proved challenging: resolving the ≥92 (2n) chromosomes in situ was not feasible due to their high number and the small size of the nuclei (approximately 5 µm in diameter on average).

      We now highlight this point as an important direction for future work in our discussion (line 456-466):

      “Additional methods such as cytogenetic karyotyping or optical mapping such as BioNano [141] (imaging of fluorescently tagged, linearized DNA) could be used to validate chromosome numbers. However, whereas karyotypes of octopuses have been consistent throughout the literature (1n=30) [142,143], those measured in decapods vary greatly. For example, 1n=46 chromosomes have been reported for two species of cuttlefish (A. esculentum and A. lycidas) and three loliginid squids [85]; 1n=36 has been reported for A. Arabica [86] and 1n=24 in A. pharaonis [87]. In S. officinalis, a karyotype of 1n=52 is reported for testis samples [88]. Combining cytogenetic preparations with fluorescent labeling of centromeric or telomeric sequences, as demonstrated in the octopus A. aerolatus [143] could help resolve these issues. Establishing a routine staining protocol would enable comprehensive tests at the species- and population-level.”

      (3) Shallow discussion of chromosomal evolution

      The manuscript briefly mentions chromosomal number differences among cephalopods but does not explore their evolutionary or functional implications. A more thorough comparative analysis - linking chromosomal rearrangements (e.g., fusions, fissions) with ecological adaptation, life history, or neural complexity - would greatly enhance the impact of the findings. Referencing chromosomal dynamics in related taxa and possible links to behavioral innovations would contextualize these results more effectively.

      We agree with the reviewer that this is a fascinating topic of research that demands further attention and have extended our discussion, which now reads (line 476-501):

      “In addition to studying chromosomal topology in phylogenetic reconstructions, some of the most interesting aspects of these rearrangements relate to changes of and innovation in regulatory elements that underlie phenotypic diversity. In coleoid cephalopods, it is thought that an ancient large-scale genome rearrangement was combined with lineage-specific changes and repeat expansions [48–50]. This restructuring gave rise to hundreds of tightly linked, evolutionarily unique microsyntenies, corresponding to distinct topological compartments with specialized regulatory architectures that contribute to complex, tissue-specific expression patterns in the nervous system and elsewhere [43]. Extending this, chromosomal conformation analyses in E. scolopes revealed that co-regulated eye and light-organ genes cluster at topologically associating domain (TAD) boundaries, and that an evolutionarily recent rearrangement at the dachshund (DAC) locus may have been instrumental in the emergence of the symbiotic light organ in Euprymna - directly linking specific chromosomal topology to morphological innovation [44].

      To understand the broader functional impact of these changes across coleoids, a recent study investigating Micro-C, RNA-seq, and ATAC-seq data from multiple species revealed broadly conserved chromatin domains, but also many lineage-specific chromatin loops that form novel regulatory signatures and impact expression profiles across species and tissues [149].

      Despite the observed small-scale regulatory changes, the chromosomes of decapods are considered to be more closely related to the ancestral coleoid karyotype than those of octopods. The derived octopod karyotype becomes apparent when comparing it to the genome of the vampire squid, an early-branching octopodiform (sister to all octopods) which retained features of the decapod, ancestral karyotype [150]. Taken together, the conserved karyotype of decapods accommodates fine-scale regulatory diversity that might underlie morphological diversity among species, which suggests that many regulatory innovations are still being evolutionarily explored through rearrangements within the existing chromosomes.”

      (4) Underdeveloped gene family and pathway analysis

      While the authors identify expansions in gene families such as protocadherins and C2H2 zinc finger transcription factors, the functional significance of these expansions remains speculative. The manuscript would benefit from:

      (a) Functional enrichment analyses (e.g., GO, KEGG) targeting these gene families.

      (b) Expression profiling across tissues or developmental stages to infer regulatory roles.

      (c) Comparison with expression or expansion patterns in other cephalopods with known behavioral complexity (e.g., Octopus bimaculoides, Euprymna scolopes).

      (d) Potential integration of transcriptomic or epigenomic data to support regulatory hypotheses.

      We thank the reviewer for these constructive suggestions and have substantially expanded the functional characterization of expanded gene families in the revised manuscript.

      To address points a) + b), we performed GO enrichment analyses for all expanded gene families (orthogroups), both for the largest gene families and the most significantly expanded families identified from our CAFE5 analysis. Further, we cross-referenced all S. officinalis members of each expanded orthogroup against differentially expressed genes in our bulk RNA-seq data from multiple tissues (initially collected to improve the gene modeling), allowing us to infer tissue-specific expression patterns for the expanded families.

      To address point (c), the species-resolved copy-number profiles from our orthogroup analysis directly situate the S. officinalis expansions within the broader coleoid context, including O. bimaculoides, O. vulgaris, E. scolopes, and D. pealeii, enabling direct comparison of expansion scale and lineage specificity across species with varying degrees of behavioural complexity. We note that the C2H2 zinc finger and protocadherin expansions show distinct phylogenetic profiles consistent with independent radiations in octopods and decapodiforms, in agreement with recent studies.

      Regarding point (d), no epigenomic data for S. officinalis was publicly available at the time of writing, thus we focused on the transcriptomic data from this study, as described above.

      We describe this analysis in two additional results paragraphs to the manuscript, one modified (Figure 4) and two new figures (Figure 5 and Supplementary Figure 7), which are reproduced (lines 294-400):

      “Analysis of expanded gene families

      We sought to investigate the S. officinalis gene annotation and place it in the context of gene repertoires from other cephalopod or molluscan species. First, we collected available genome annotations from 12 other molluscan species (Table 2) and clustered them using OrthoFinder v.3.1.0 [122], resulting in 23,658 orthogroups, hereafter named gene families.

      First, we investigated 36 of the gene families that contain more than 100 genes in any of the species, with 17 of these families containing at least one gene of S. officinalis, that reflect large-scale gene family expansions (Figure 4E). We used the InterProScan and eggNOG-mapper annotations to infer functional roles of these genes, selecting the most common gene annotation as the name of the gene family.

      The zinc finger C2H2-type transcription factors (TFs) were grouped into three of the large gene families, with the largest family (OG0000000) only present in decapod cephalopods. This likely reflects the largely independent expansions in the octopod and decapod lineages that date back to a burst of transposon activity ca. 25 million years ago [46,48,49]. The largest expansion across mollusks occurs in the cadherin-like family (OG0000001): 310 in S. officinalis, 283 in D. pealeii, 209 in A. lycidas, 102 in O. vulgaris, 55 in O. bimaculoides, with low but non-zero counts in bivalves (C. virginica, M. gigas). This profile is consistent with the protocadherin expansion first described in O. bimaculoides [46] and subsequently shown to be present across cephalopods [48,49,123].

      HPGDS (OG0000005, hematopoietic prostaglandin D synthase) is a glutathione-S-transferase family member that catalyzes the conversion of prostaglandins, which have well-described roles in immune responses in vertebrates and insects [124,125]. This family shows a broad expansion in decapods, with a lesser expansion in octopods. Additionally, members of the glutathione-S-transferase families have been co-opted as S-crystallins, structural proteins found in the lens of cephalopods that may, or may not, retain enzymatic functions [126,127].

      Two large families are mostly lineage-restricted. The RING-type zinc finger family (OG0000058) has 103 copies in S. officinalis and 26 in A. lycidas but is absent in all other species except for E. scolopes. Conversely, OG0000002 (unknown function) has 479 copies in E. scolopes and only a few copies in the other species. This interesting Sepiolid-specific expansion warrants further characterization.

      We estimated gene family evolution rates using CAFE5 [128] for all families with less than 100 copies in any species (this excludes the families described above, as very large copy-number differences between species preclude likelihood calculations under the applied birth-death model). After comparing different model parameters, we chose a gamma model with three rate categories, allowing for evolutionary rate variation among gene families. Out of the 12,895 gene families analyzed, 1,813 showed a significant (p < 0.05) expansion or contraction in at least one of the species. We focused our analysis on the 30 most significantly expanded families; among them were several retrotransposon-associated domains that have expanded specifically in S. officinalis five families carrying Retrovirus-related Pol polyprotein domains, two Reverse transcriptase domain families, and four Ribonuclease H-like families (Supplementary Figure 7A). There was no coordinate-based overlap of the coding sequences with annotated TEs from the RepeatMasker output (Methods).

      In addition to the three large gene families of C2H2 zinc finger expansions, 45 gene families containing this TF type showed a significant change in the CAFE5 analysis. Notably, eight of the significant gene families, as well as four of the largest gene families, were annotated as CCHC-type zinc fingers, which contain a “zinc knuckle” motif that is characteristic of retroviral nucleocapsid proteins [129] and is functionally integrated in the genomes of several species, including humans [130].

      Some gene families without any relationship to retrotransposons were also expanded. For example, the UGT2A1-related family is a UDP-glucuronosyltransferase, a class of enzymes central to phase II detoxification and conjugation of metabolites, reported in other mollusks in the context of environmental chemical tolerance [131], and in insects in the context of pigmentation [132]. We also detected a family of homeodomain-like proteins, representing an expansion of this important TF family.

      Tissue-specific expression of expanded gene families

      To place the identified gene families in a functional context, we profiled their expression in the bulk RNA-seq data (taken from multiple tissues of S. officinalis) used originally for gene modeling (Figure 5A). Principal component analysis (PCA) revealed the largest axis of variation in gene expression to separate brain tissues from peripheral tissues, with skin being the most transcriptomically distinct (Figure 5A), consistent with the high number of tissue-specific differentially expressed (DE) genes identified in non-neural tissues (Figure 5B). We identified the genes belonging to expanded families that were differentially expressed across tissues and enriched gene ontology [133,134] (GO) terms for them to gain additional insight. The large families excluded from CAFE5 modelling and the significantly expanded families identified by CAFE5 were analyzed separately.

      Eleven of the largest gene families were expressed in our data (Figure 5C) and five had enriched GO terms (Figure 5D,E). Among them, the cadherin family showed brain-restricted expression and GO terms related to cell–cell adhesion and calcium binding, consistent with their role in neuronal connectivity and circuit formation [46,135]. Two C2H2 zinc finger gene families were expressed in the optic and vertical/subvertical lobes of the brain and in the skin, with GO terms related to DNA-binding, transcriptional regulation or development. The RING-type zinc finger family was expressed specifically in the skin, with GO terms including zinc binding and ubiquitin protein ligase activity, the canonical function of RING-domain E3 ligases [136]. Genes of the HPGDS/S-crystallin family were expressed in the brain (basal and optic lobes and posterior subesophageal mass) and skin, with GO terms related to glutathione metabolism, matching their described enzymatic function. We did not find expression in the retina, which is expected given that S-crystallins are expressed in lentigenic cells of the eye [42,137] and these cells were not included during sampling.

      Among the 30 most significantly expanded families examined (out of 1,813 total), expression was widespread (20/30) and tissue-specific differential expression was common (17/30), suggesting that a substantial proportion of expanded paralogs represent functional coding sequences with specialized spatial deployment (Supplementary Figure 7B). Ten of the retrotransposon-associated families were differentially expressed in the brain (optic and vertical/subvertical lobes) and skin, arguing against these loci being inactive repeat fragments and supporting their inclusion as transcribed gene models. Two significantly expanded families showed both differential expression and enriched GO terms (Supplementary Figure 7C). The first was the UGT2A1-related family, which had the largest number of differentially expressed genes overall, with expression concentrated in the skin, retina and posterior subesophageal mass of the brain. Enriched GO terms matched the described enzymatic function for this family, namely UDP-glycosyltransferase activity. The second gene family was the homeodomain-like family with enrichment for DNA binding terms consistent with their role as transcription factors, and was preferentially expressed in the vertical and subvertical brain lobes with weaker expression in other areas.

      Collectively, many differentially expressed genes from expanded families were restricted to specific tissues or brain subregions (Figure 5F and Supplementary Figure 7D), indicating that paralogs within an expanded family have adopted distinct spatial expression domains and possibly, specialized functions.”

      Reviewer 2 (Public review):

      Summary:

      This paper concerns an interesting organism, Sepia officinalis. However, in the opinion of this reviewer, the paper reads somewhat like a genome report. The authors have used 23x PacBio HiFi in conjunction with relatively low coverage (11x) Hi-C to scaffold the genome into a karyotype of 47 chromosomes. They have used a combination of short and long read RNA seq to annotate the genome in what looks like a very good annotation. The paper offers basic analyses of the Busco evaluation, some descriptive analyses of gene family and repeat content, and a bit more focused analysis on synteny among sequenced squids. Generally, the data will be useful.

      Strengths:

      This is a high-quality annotation, and the data ultimately will be useful to other researchers. I appreciate trying to understand what's happening between assemblies of S. officinalis.

      Weaknesses:

      I don't believe the data at hand makes a strong case for the argument of 47 chromosomes. This is my biggest sticking point with the paper, and it is for a few reasons:

      (1) The authors point to assembly differences between the DToL assembly and the one presented in the manuscript and seem to claim that DToL is incorrect. However, the DToL assembly (xcSepOffi3.1) is based on much deeper HiFi and HiC coverage than the one at hand (51x and 80+x respectively). There are many things to try here, including:

      (a) Downloading the DToL data and reassembling using a common pipeline.

      (b) Downsampling the DToL data to similar coverage as what the authors have achieved.

      (c) Combining your data and that of DToL for even deeper coverage (heterozygosity is low enough that I don't imagine this impeding things too badly).

      We thank the reviewer for these helpful suggestions and want to clarify that we did not seek to point out errors in the DToL assembly, but rather to investigate the unexpected discrepancies between the two assemblies. It is correct that the DToL data has a much higher coverage than our data. We followed the individual suggestions and incorporated them into the revised manuscript. We reproduce the relevant sections below, and provide additional information:

      (a) Downloading the DToL data and reassembling using a common pipeline.

      We downloaded the DToL data and reassembled it using a common pipeline, yielding the results listed in Author response table 1. The DToL assembly is more contiguous, which is mainly due to its higher HiFi coverage. It also receives slightly better BUSCO scores (computed using odb12 as recommended by Reviewer 3).

      Author response table 1.

      Full statistics of S. officinalis assemblies from two independent datasets, assembled using a common pipeline.

      The updated manuscript now reads (lines 146-159):

      “A chromosome-scale assembly for Sepia officinalis was released recently by the Wellcome Sanger Institute’s Darwin Tree of Life project [75] (DToL, GCA_964300435.1). That genome was assembled from a male individual using high coverage PacBio Sequel II (~51x) and Arima2 Hi-C (~80x) data, with a final assembly size of 5.8 Gb. The the haploid chromosome number was estimated to be 49. To compare both S. officinalis datasets directly, we downloaded the DToL data and created two new assemblies using the pipeline described above (hifiasm using PacBio HiFi and Hi-C data). The resulting assemblies were overall very similar, with the DToL assembly having a slightly higher contiguity (N50 length, see Table 1) and BUSCO completeness (Supplementary Figure 2A,B) due to their higher sequencing coverage.”

      To further compare the two datasets, we added a new Figure 2 to the revised manuscript and the following paragraph to the results (lines 160-169):

      “After scaffolding with YAHS, both datasets reached the previously identified chromosome numbers (1n=47 for MPIBR and 1n=49 for DToL, Figure 2A,B). To further investigate this surprising discrepancy, we aligned both assemblies using Winnowmap [89] to locate the differences between them (Figure 2C). We observed four “breakpoints” (BP) of chromosome scaffolds: one in the MPIBR assembly compared to DToL (BP1: DToL_5 = MPIBR_40+44) and three in the DToL assembly compared to MPIBR (BP2: DToL_31+40 = MPIBR_2, BP3: DToL_41+46 = MPIBR_6, BP4: DToL_44+45 = MPIBR_7). We also aligned the assemblies to the chromosome-scale genome of another cuttlefish Acanthosepion esculentum (1n=46, GCA_964036315.1). In this alignment, all four breakpoints were collinear with single A. esculentum chromosomes (Figure 2D).”

      (b) Downsampling the DToL data to similar coverage as what the authors have achieved.

      Instead of downsampling the DToL data, we decided to analyze the Hi-C and HiFi data for both assemblies, focusing on the four “breakpoints” between the assemblies and the A. esculentum genome that we described above. First, we performed a QC analysis of the Hi-C reads using pairtools [2], the result is visualized in Author response image 1. The percentage of valid Hi-C read pairs, i.e., cis pairs with insert distances of more than 1 kb and trans pairs, following the Dovetail genomics QC manual (https://dovetail-analysis.readthedocs.io/en/latest/whole_genome/qc.html). When Hi-C pairs were aligned to the primary contigs from hifiasm (as is used for scaffolding with YAHS), the DToL HiC data contains fewer valid read pairs (11.4%) than the MPIBR data (43.1%), possibly due to using a different tissue (eye vs. optic lobe) and HiC kit (Arima 2 vs. Dovetail OmniC) for the library preparation. Nonetheless, due to the much higher overall coverage, the amount of valid read pairs is still 2.35x higher for DToL (144,014,368 pairs) than for MPIBR (61,318,955 pairs). The higher trans fraction (i.e. HiC pairs across contigs) is dependent on the length of the primary contigs, so the higher trans fraction for the MPIBR data can be explained by the lower contiguity of its primary contigs. It is conceivable that for both assemblies, the low numbers of valid read pairs introduce a technical fragmentation of certain chromosomes, as indicated by the identified breakpoints (Figure 2).

      Author response image 1.

      Analysis of Hi-C read pairs from both S. officinalis assemblies. Hi-C reads were aligned to the primary contigs from hifiasm (as is used for scaffolding with YAHS) and analyzed using pairtools. Note the higher fraction of long-range contacts (at least 1 kb cis pairs or trans pairs) in the MPIBR data (top) compared to DToL (bottom). Due to overall higher coverage, the absolute number of read pairs is higher for DToL than for MPIBR data.

      Second, we performed a detailed analysis of read coverage along the breakpoint junctions of the discrepant chromosomes/scaffolds between both assemblies. We included a description of the results and a new Supplementary Figure 3 in the manuscript, (lines 171-207):

      “To better understand the potential cause of these divergent chromosome numbers, we analyzed the Hi-C and HiFi coverage in the breakpoint regions (Supplementary Figure 3A). First, we aligned the Hi-Fi reads to the scaffolds and extracted all alignments along the 200 kb terminal scaffold windows to find any notable drops in coverage, or reads spanning any of the scaffold junctions. We detected no spanning reads. This is not surprising given that no contigs were assembled at these sites, resulting in the observed scaffold junctions. More interestingly, we noted a ~5-fold decrease in HiFi coverage along the DToL scaffold_40 (part of BP2) relative to its flanking regions, indicating a highly repetitive, low-mappability region at this boundary.

      Next, we realigned the Hi-C data to the scaffolded assemblies using bwa-mem2 [91] and extracted all trans HiC pairs (between-scaffold contacts) using pairtools [92]. We normalized trans HiC contacts to the scaffold length and compared contact rates between breakpoint scaffolds to the baseline contact rate (computed from pairs of scaffolds with a clear 1-to-1 match between assemblies), and the contact rate within scaffolds (intra-scaffold pairs) (Supplementary Figure 3B,C). The contact rates within breakpoints were consistently lower than within scaffolds, likely falling below the threshold to be merged during assembly. However, the contact rates at three of four breakpoints (BP1, BP3, BP4) were significantly elevated above the genome-wide background distribution (empirical p = 0.010, 0.005, 0.005 respectively), suggesting that they may represent intra-chromosomal contacts disrupted by a misassembly. Notably, BP2 was not significant (empirical p = 0.170), likely due to the low coverage and mappability around the DToL scaffold_40 boundary. Considered jointly, the three DToL breakpoint scaffold pairs showed significantly higher trans contact rates than the background (Wilcoxon rank-sum, one-tailed, U = 1771, p = 0.004).

      Lastly, we analyzed the repeat landscape around the 200 kb scaffold ends using RepeatMasker [93] and the custom repeat library that we had generated for Sepia officinalis (described further below). Compared to control scaffolds of the same assembly, we observed consistently elevated repeat content at the breakpoint junctions (mean 71.5% vs 67.6% masked bases), with an enrichment of unclassified repeats (32.1% vs 30.0%), which could explain a repeat-driven assembly fragmentation or scaffolding failure. The BP2 DToL scaffold_40 junction window was 99.99% masked (99.2% unclassified repeats), providing a likely mechanistic explanation for both the HiFi coverage drop and the absence of a significant trans Hi-C signal at this breakpoint. Taken together, these analyses suggest that the different chromosome numbers across the two S. officinalis assemblies are due to technical reasons, caused by repeat-rich scaffold boundaries that impair HiFi and Hi-C read alignment and in turn, correct assembly in these regions.”

      (c) Combining your data and that of DToL for even deeper coverage (heterozygosity is low enough that I don't imagine this impeding things too badly).

      When combining the data to achieve a higher coverage, we ran into the assembly fragmentation issues detailed above in response 1) to Reviewer 1.

      (2) Looking at Figure 1, there appears to be a misjoin at chromosome 42. Looking carefully at Figure S1, that misjoin does not appear on any of the panels - this is confusing. Given the size of that chromosome and the authors' chromosome numbering, I'm guessing this is a manual merge (as it's larger than most of the chromosomes numerically close (40, 41, 43, etc). Further, staring closely at Figure 1, there appear to be cross-scaffold contacts between 42 and 43 and 42 and 44. Secondarily there are contacts between 43 and 44. This bit of the assembly seems potentially problematic.

      This is a great observation, indeed the HiC maps differ between Figure 1 and Figure S1. Figure 1 is the result of scaffolding with YAHS and manual curation, whereas Figure S1 was scaffolded using HapHiC. We updated the figure legend to clarify this important difference. HapHiC produces very clean contact maps without the need for manual curation, but when analyzed at a higher resolution, the tool broke many contigs and ultimately compromised the assembly quality, possibly due to our comparatively low HiC coverage. Thus, we preferred to use YAHS and manual curation, which is perhaps inherently error-prone, as becomes apparent in the regions of the assembly that are pointed out by the reviewer.

      Reviewer 3 (Public review):

      Summary:

      In this study, authors Simone Rencken and co-authors present and investigate the genome of the common cuttlefish Sepia officinalis.

      Strengths:

      The authors explain in a detailed yet concise manner the main steps for a genome assembly, with very robust methods for validation, and according to current best practices. In addition to the chromosomal assembly, the authors confirmed the presence of 47 chromosomes using Hi-C data and multiple species synteny. They also generated a comprehensive gene annotation, with assessments of gene completeness, providing a useful resource for the community of researchers interested in cuttlefish biology and comparative genomics.

      Weaknesses:

      While the study touches upon the subjects of gene content, TE activity, or species-level comparisons, the study does not provide in-depth investigations of these.

      We thank the reviewer for their positive assessment of our manuscript. We acknowledge the descriptive nature and limitations of our previous analyses of gene content, TE distribution, and species comparisons. Our focus for the initial submission was to provide a high-quality assembly that could serve as a resource for anyone interested in Sepia officinalis or related species. However, we agree that greater insight into genome content is valuable as well. In the revised manuscript, we included a more detailed analysis of expanded gene families and GO enrichment analysis of our bulkRNAseq data, which we summarized in response 4) to reviewer 1.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      Minor Revisions Recommended:

      (1) Figure and legend clarity

      Several figures lack sufficient annotation. All figures, including supplementary ones, should include:

      (a) Clear axis labels.

      (b) Descriptions of statistical measures (n values, error bars, statistical tests).

      (c) Legends that allow the figure to be understood independently of the main text.

      We updated the figures accordingly.

      (2) Terminology and formatting

      (a) Consistency in gene and species nomenclature should be maintained throughout (e.g., italicizing gene names and Latin binomials).

      (b) Ensure that abbreviations (e.g., Hi-C, BUSCO, FISH) are defined upon first use.

      We updated the nomenclature throughout the text and checked the definition of abbreviations used in the text. Further, we updated the names of several cuttlefish species according to the recent revision of genera, e.g. Sepia esculenta was changed to Acanthosepion esculentum [3].

      (3) Literature coverage

      The references primarily focus on earlier studies from 2010-2020. It would strengthen the context to include recent high-impact studies on cephalopod genomics and chromosomal biology published in the last 3 years (e.g., 2022-2024).

      We apologize for this oversight and have extended the manuscript to discuss more of these recent studies.

      (4) Clarify methods

      While the methods section is generally detailed, some critical aspects are underspecified:

      (a) Parameters used in genome annotation tools (e.g., BRAKER, RepeatMasker).

      We thank the reviewer for bringing our attention to this shortcoming, and have added the missing parameters to the methods section. Additionally, the full code is available at https://gitlab.mpcdf.mpg.de/mpibr/laur/cuttlefishomics/soffgenome

      (b) Criteria for ortholog clustering and gene family expansion analysis.

      The details have been added to the methods section, which now reads (lines 828-853):

      “Orthogroups were inferred across 13 molluscan species (Table 2), including S. officinalis, using OrthoFinder v3.1.0 [122] with default parameters. The input proteomes included the longest protein isoform per gene for each species. The rooted species tree from OrthoFinder [182,184] was converted to an ultrametric tree using the R package ape [183] v5.8.1.

      Gene families were filtered by removing orthogroups present in only a single species, and by separating orthogroups containing 100 or more gene copies in any species, as extreme copy-number differences in gene families prevent likelihood calculation under the applied birth-death model.

      Gene family evolution rates were estimated using CAFE5 [128] v5.1.1 on the filtered orthogroups, using the ultrametric species tree as input. Four models were evaluated: the base model (single global lambda), and Gamma models with k = 2, 3, and 4 rate categories, which allow evolutionary rate variation among gene families. The Gamma k = 3 model was selected based on the best (lowest) final log-likelihood score. All subsequent statistical inferences were performed under this model.

      For families showing statistically significant expansion or contraction (p < 0.05 after Bonferroni correction), branch-specific copy-number changes were extracted from the CAFE5 output. Families were categorized as S. officinalis-specific, coleoid-specific, or broad expansions based on the distribution of significant changes across the phylogeny.

      To assess whether expanded gene families in S. officinalis contained genes derived from or embedded within repetitive elements, a coordinate-based overlap analysis was performed. For each gene in an expanded orthogroup, the overlap between its coding sequence (CDS) coordinates and RepeatMasker annotations was computed using bedtools intersect v2.30 [185]. To avoid double-counting when multiple repeat annotations overlapped the same coding bases, overlapping repeat intervals were merged per gene prior to summing covered bases, and the overlap fraction was computed as merged covered bases divided by total CDS length.”

      (c) Thresholds or cutoffs for synteny or duplication detection.

      We included the details in the updated methods (lines 755-781):

      “Synteny analyses between all chromosomes of the compared species were performed using the R package GENESPACE v.1.2.3 [175] with default parameters, described briefly below. Protein sequence similarity was first estimated using DIAMOND2 [109] in fast mode, and orthogroups and pairwise orthologues were inferred using OrthoFinder v2.5 [176] with hierarchical orthogroups (HOGs) enabled. Prior to synteny inference, tandem arrays were condensed to their most central representative gene, and gene rank order was recalculated on these array-representative genes to reduce confounding effects of tandem duplication on collinearity detection.

      Syntenic blocks were identified pairwise between all genome combinations using MCScanX [177], constrained to DIAMOND hits where both query and target genes belonged to the same orthogroup (onlyOgAnchors = TRUE). Initial anchor hits were clustered into large syntenic regions using a density-based spatial clustering approach (dbscan [178]), with a minimum block size of five anchor genes (blkSize = 5) and a maximum of five intervening non-anchor genes permitted within a block (nGaps = 5). Anchor clustering used a search radius of 25 gene-rank positions (blkRadius = 25). All hits falling within a syntenic buffer of 100 gene-rank positions around confirmed block anchors (synBuff = 100) were retained as syntenic. No secondary syntenic hits were included (nSecondaryHits = 0). Syntenic orthogroups were integrated across all pairwise comparisons and collapsed into a pan-genome annotation anchored to. S. officinalis was used as the reference genome.

      Syntenic relationships were visualized as riparian plots and pairwise dotplots using the built-in plotting functions of GENESPACE v1.2.3. Riparian plots were constructed using physical chromosomal coordinates (useOrder = FALSE) with S. officinalis as the reference, displaying all three genomes. A second riparian plot was generated highlighting a region of interest. Pairwise dotplots were produced species for the S. officinalisD. pealeii and S. officinalisE. scolopes genome comparisons, displaying only synteny-validated hits (type = "syntenic") with a minimum synteny score of 10 (minScore = 10) and a minimum of 10 genes per chromosome pair required for display (minGenes2plot = 10).”

      Reviewer #2 (Recommendations for the authors):

      Line 153 should be supplemental Figure 3B.

      The text was referring to the correct Figure 2B (three species synteny comparison). It is now updated to Figure 3B in the revised manuscript.

      Reviewer #3 (Recommendations for the authors):

      (1) L37: Perhaps add a comparison with other species (mammals, Drosophila, etc.) to put this number in context.

      We agree with this recommendation and added numbers for Drosophila and mouse to the text (lines 40-45):

      “Coleoid cephalopods (octopus, squid, cuttlefish) are a highly derived group of mollusks, characterized by the largest nervous systems among all invertebrates (ca. 500 million neurons in an adult octopus of which 200 million are in the central brain [1,2], compared to ca. 140,000 in the fruit fly [3] or 70 million in the mouse [4]) and specializations with a great historical importance for neuroscience (e.g., “giant axons” [5] and “giant synapses” [6–8]).”

      (2) L51, 279: "Octopodiformes" is a superorder, not a genus or a species name. It should not go in italics.

      We updated this throughout the text.

      (3) L53: "even smaller" seems odd here, because the argument of the sentence is to stress the large genome size of Octopodiformes. Perhaps start the sentence by stating that it is sometimes smaller, but often larger.

      We rephrased the sentence for clarity, it now reads (lines 55-58):

      “While the genomes of Octopodiformes (Octopus, Eledone, Argonauta) are either smaller than (1.1 Gigabases or Gb [45]) or comparable in size to that of humans (around 3 Gb [46,47]) the typical genomes of Decapodiformes (squids and cuttlefish) often reach 6 Gb [48,49].”

      (4) L90: What tool was used to estimate the k-mer distribution of the long reads? Jellyfish? FastK? It's not mentioned anywhere in the text.

      (5) L95: What k-mer size did the authors use to estimate k-mer distribution?

      We thank the reviewer for pointing out this missing information, and have included the details in the methods (lines 692-694):

      “The k-mer distribution was estimated using Meryl [165] within the Merfin [166] package with a k-mer size of 21, and genomeGenome size was estimated using GenomeScope [77] from Illumina short reads and PacBio HiFi data.”

      (6) L99: What about using the most recent BUSCO databases? odb12?

      We thank the reviewer for this question, which prompted us to compute BUSCO scores using the more recent odb12 database. The results are shown in Supplementary Figure 2C. Both gene sets have been refined by including more species and using a more stringent filtering approach, so the more recent database contains fewer and more conserved genes [4]. For the mollusca gene sets, a great improvement in completeness was observed between odb10 and odb12 (Supplementary Figure 2C); the metazoan completeness was marginally increased. Therefore, we evaluated all new assemblies produced since the first submission with the odb12 database.

      (7) L107: How many scaffolds were obtained in total? After manual curation, how many of the scaffolds were placed in the "correct" chromosomes? How many scaffolds were in the shrapnel? Were these scaffolds mostly repetitive regions? Or did they contain important genetic information?

      These are important questions. To evaluate the content of the “shrapnel”, we split the manually curated assembly into the 47 chromosomes and the 1840 residual scaffolds, and computed BUSCO scores for both. While the 47 chromosome scaffolds contain the majority of conserved genes: C:92.9%[S:92.7%,D:0.1%],F:4.0%,M:3.1% with metazoa_odb12 and C:88.7%[S:88.0%,D:0.7%],F:4.4%,M:6.9% with mollusca_odb12, the unplaced scaffolds still contain a few BUSCOs: C:2.5%[S:2.4%,D:0.1%],F:2.4%,M:95.1% from metazoa_odb12 and C:1.9%[S:1.7%,D:0.2%],F:1.2%,M:96.9% from mollusca_odb12. Even if only a few BUSCOs are present on these scaffolds, it means they contain important genetic information. Additionally, we observed low, but non-zero alignment of RNA reads to these scaffolds. We observed a slightly elevated repeat content in the unplaced scaffolds (Author response image 2), and a variable base composition (Figure 1C) compared to the chromosome scaffolds.

      Author response image 2.

      Quantification of repeat content in chromosome scaffolds and unplaced residual scaffolds. Density plot showing fraction of repeat masked bases in total sequence length for chromosome scaffolds (i.e. scaffolds 1-47) in teal and all remaining small scaffolds (1840 scaffolds) in purple. Median repeat fraction is shown as vertical lines.

      The slightly elevated repeat content in the unplaced scaffolds provides a likely explanation for their fragmented state: repeat-rich regions are inherently difficult to assemble and scaffold, as repetitive sequences cause ambiguous read alignments that prevent contigs from being confidently joined or anchored to chromosomal scaffolds during HiC-based scaffolding. This is consistent with the near-complete absence of BUSCO genes from the unplaced scaffolds - not because these fragments lack biologically relevant sequence entirely, as evidenced by the residual BUSCO hits and RNA read alignments, but because the gene-rich portions of the genome are largely captured in the 47 chromosome scaffolds. The unplaced scaffolds instead likely represent fragmented contigs from repetitive or low-complexity genomic regions, such as centromeres, telomeres, and transposable element clusters, where assembly graph complexity and collapsed repeats prevent confident placement. The variable base composition further supports this interpretation, as GC-extreme or low-complexity sequences are disproportionately represented in assembly shrapnel. Together, these observations suggest that the unplaced scaffolds contain limited unique coding content but reflect genuine repeat-rich genomic sequence that cannot currently be placed without additional long-range information, such as optical mapping or ultra-long reads.

      (8) L33, 53, 240, 255, 279: Decapodiformes, not in italics.

      We changed this throughout the text.

      (9) L228: Can you put this expansion in perspective with other taxa?

      We added a more detailed comparison of our gene family expansion with different species to the revised manuscript, as detailed in response 4 to reviewer 1.

      (10) L251: "However, our results show how difficult it still is to assemble large genomes with high karyotype numbers." Can you clarify how your results show this, because it is equally spectacular to assemble the karyotype with only PacBio and Hi-C data (and no linkage mapping).

      Indeed, it is correct that the recent improvements in data quality and scaffolding algorithms enable these “spectacular” chromosome-scale assemblies without the need for linkage mapping. This sentence reflected our expectation to resolve a clear karyotype as has been demonstrated for multiple cephalopod genomes in recent years, including two cuttlefish species (Octopus bimaculoides, Octopus vulgaris, Euprymna scolopes, Euprymna berryi, Acanthosepion lycidas and Acanthosepion esculenta). To our knowledge, none of these publications used linkage mapping or cytogenetic methods to confirm the karyotype. In this light, our resulting chromosome number and the discrepancy to a second assembly of the same species led us to this conclusion. We updated the section in the revised discussion as follows (lines 466-473):

      “Taken together, our results illustrate the difficulty of assembling large genomes with high repeat content and large karyotypes, at least from sequencing data alone. Internal validation methods and genome comparisons across species are therefore important. Convergence of reliable estimates will, in turn, help identify chromosomal fusion-with-mixing events (FWM; fusion of two ancestral chromosomes followed by extensive shuffling of their gene content) that are clade specific. Early branching order in Decapodiformes has been notoriously unstable [53,84,94,144–147]; thus, such rare and irreversible FWM characters could be useful in further phylogenetic analysis of this clade [51,148].”

      (11) L419: Why use the phased haplotype 1 instead of the primary assembly generated by hifiasm?

      We thank the reviewer for this important question. We used the phased haplotype assembly because it provides a biologically coherent representation with the least amount of duplication by avoiding allele-collapsing and haplotype-switching that can be present in the primary assembly. We reasoned that this would result in clearer gene models and a more accurate representation of structural variation. However, we acknowledge that this comes at the cost of reduced contiguity and completeness, as becomes apparent in our BUSCO comparison shown in Supplementary Figure 2, where the phased haplotypes have fewer duplicated genes than the primary assembly, but more missing genes in turn. When reassembling both datasets for our comparison, we used the primary assembly to use the longest contigs as input for scaffolding.

      (12) L444: It is unclear from what tissues and life stages RNA-seq data were used or were available from other species.

      This is an important detail. RNA-seq data was collected from two adult Sepia officinalis, from various tissues (whole brain, retina, skin, mantle, arm, tentacle). For the long-read PacBio Isoseq data, tissue was taken from the animal used for genome sequencing (6 months old), and tissue for short-read Illumina RNA-seq was taken from another adult (8 months old). The data have been released on SRA (study accession SRP570862), where all sample details are listed as well. We added the SRA accession to the data availability section of the revised manuscript. We clarified the relevant sections in the methods:

      lines 628-629:

      “RNA was isolated from various flash-frozen tissues (different brain areas, mantle/epidermis, arm/tentacle; 5-10 mg each).”

      lines 678-680:

      “For short-read RNA sequencing, tissue from another animal (8-month-old adult, F0 from eggs collected in Normandie, France) was used. RNA was isolated from various flash-frozen tissues (different brain areas, skin and retina; 5 mg each).”

      (13) L454, 469: Why is minimap2 in italics? It wasn't formatted like this before. Same for StringTie.

      We thank the reviewer for their detailed methods review. In the updated methods section, all formatting of used softwares was harmonized.

      (14) L461: Lophotrochozoa is a clade, not a genus or species. Not in italics.

      This is now changed throughout the revised manuscript.

      (15) Figure 1D: Axes labels are hard to read.

      We have now increased the axis label size.

      (16) Figure 2: Consider increasing font sizes. Many chromosome orientations seem to be flipped across species, which makes it harder to see smaller-scale rearrangements or notice less conserved chromosomes. Would it make sense to standardize these?

      We increased the font sizes and plotted only fully collinear syntenic blocks (instead of aggregated syntenic regions, the default of GENESPACE) for improved readability.

      References:

      Below are references cited in our responses. References from the reproduced manuscript sections are included in the revised manuscript.

      (1) Secomandi, S., Gallo, G.R., Rossi, R., Rodríguez Fernandes, C., Jarvis, E.D., Bonisoli-Alquati, A., Gianfranceschi, L., and Formenti, G. (2025). Pangenome graphs and their applications in biodiversity genomics. Nat. Genet. 57, 13–26. https://doi.org/10.1038/s41588-024-02029-6.

      (2) Open2C, Abdennur, N., Fudenberg, G., Flyamer, I.M., Galitsyna, A.A., Goloborodko, A., Imakaev, M., and Venev, S.V. (2023). Pairtools: from sequencing data to chromosome contacts. Preprint at bioRxiv, https://doi.org/10.1101/2023.02.13.528389 https://doi.org/10.1101/2023.02.13.528389.

      (3) Lupše, N., Reid, A., Taite, M., Kubodera, T., and Allcock, A.L. (2023). Cuttlefishes (Cephalopoda, Sepiidae): the bare bones—an hypothesis of relationships. Mar. Biol. 170, 93. https://doi.org/10.1007/s00227-023-04195-3.

      (4) Tegenfeldt, F., Kuznetsov, D., Manni, M., Berkeley, M., Zdobnov, E.M., and Kriventseva, E.V. (2025). OrthoDB and BUSCO update: annotation of orthologs with wider sampling of genomes. Nucleic Acids Res. 53, D516–D522. https://doi.org/10.1093/nar/gkae987.

    1. eLife Assessment

      This study presents valuable evidence of sex differences in oxycodone relapse-related behavior alongside novel characterization of synaptic adaptations in the paraventricular thalamus - nucleus accumbens shell circuit. The authors show that females exhibit heightened cue-induced seeking after 14 days, but not 1 day, of abstinence, while both sexes display similar time-dependent strengthening of paraventricular thalamus - nucleus accumbens shell glutamatergic transmission. The revised manuscript strengthens the work through improved statistical analyses, clearer interpretation, and expanded integration with prior literature. The strength of evidence is solid. However, association among experiments is incomplete, as the sex-specific behavioral effect is not reflected in circuit-level plasticity, and no causal manipulations test pathway involvement in relapse. Future work could link these circuit adaptations to sex-specific relapse vulnerability.

    2. Reviewer #1 (Public review):

      Summary:

      This manuscript by Alonso-Caraballo et al, is a novel piece of work that examines the impact of oxycodone self-administration on neural plasticity within paraventricular thalamic (PVT) to nucleus accumbens shell (Shell) pathway - two regions shown to play a key role in cue-induced drug seeking on their own, and whether this plasticity varies based on abstinence period and biological sex.

      Strengths:

      The authors show using a clinically relevant long-access model of opioid self-administration promotes dependence and acute withdrawal in both male and female rats. During subsequent cue-induced relapse tests at 1 or 14-days following the conclusion of self-administration, data show that while both male and females demonstrate drug-seeking behavior at both time points, females show a further elevation in responding on day 14 versus day 1 that is not observed in the males. When accounting for past work showing elevations in drug seeking in males after 30 days, these data indicate that craving-induced relapse for opioids may develop faster and may be more pronounced in females compared to males.

      These behavioral findings were paralleled by use of ex vivo acute slice electrophysiology and circuit-specific ex vivo optogenetics to examine the impact of oxycodone self-administration on synaptic strength within the paraventricular thalamus (PVT) to nucleus accumbens shell (NAcSh) pathway(s). Data support a time-dependent but sex independent strengthening of glutamatergic signaling at PVT-to-NAcSh medium spiny neurons (MSNs) that is only present following a relapse test at 14 days post abstinence in males versus females, providing the first evidence that opioid self-administration and/or cue-induced drug-seeking augments this pathway. Using an extensive set of physiological measures, the authors show that this increased synaptic strength reflects a upregulation of presynaptic release probability. Further, this upregulation of excitatory signaling aligned temporally with an increase in MSN excitability, as assessed by increases in action potential firing frequency. Finally, the authors provide the first evidence that similar to other inputs to the NAcSh, PVT projections innervate both MSN as well as local interneurons, promoting a GABA-A specific feedforward inhibitory circuit. Interestingly, unlike direct excitatory inputs to MSNs, no changes were observed ostensibly within this feedforward circuit, highlighting a selective enhancement of excitatory drive and output of MSNs with protracted abstinence.

      Overall, these data highlight a potential role for heightened synaptic strength within the PVT-NAcSh pathway in cue-induced relapse behavior during protracted abstinence and identify a potential therapeutic target during abstinence to reduce relapse risk in abstaining individuals.

      Weaknesses:

      Overall, the experimental approach and data provided appear rigorous and support their overall conclusions and achieve their goal of understanding how opioid self-administration impacts synaptic strength within the PVT-NAcSh pathway. Although not undermining these data, there are a few potential weaknesses that reduce the impact of the work. For example, the inability to directly assess whether cue-induced drug-seeking is in fact augmented compared to daily intake during self-administration in the maintenance face only permits the authors to denote that reexposure to cues and the context is sufficient to promote active lever pressing without demonstrating whether seeking behavior is in fact elevated further during a cue test. This is notably understandable as drug available sessions were 6-hours versus a 1hour relapse test. Importantly, it is clearly demonstrated that drug seeking is higher on average in female mice after 14 days versus 1 day.

      With regard to interpretation of electrophysiology findings, the lack of inclusion of an abstinence only group does not permit interpretations to parse out whether observed increases in synaptic strength (or the lack of) reflect abstinence or an interaction between abstinence period and re-exposure to the operant chamber, as slices were taken 30-45 min post relapse test. While much literature has shown that drug induced adaptations in the NAc requires a post drug period for plasticity to measurably emerge, studies have also shown that re-exposure to heroin-associated cues following abstinence seemingly "reverses" increases in cell excitability in prelimbic-NAc pyramidal neurons (Kokane et al., 2023) and that depotentiation of morphine-induced increases in synaptic strength in the NAc shell can be depotentiated by drug re-exopsure -- an effect also observed with cocaine re-exposure (Madayag et al., 2019). Notably, the lack of effect at 14 but not 1 day supports the likelihood that the relapse test does not in fact influence the plasticity within the PVT-NAcSh circuit.

      While the lack of effect on AMPAR:NMDAR ratio and rectification indices do support the notion that enhanced EPSC amplitudes in input-output curves do not reflect a change in AMPAR subunit expression (i.e., increased GluA2-lacking receptors that exhibit inward rectification at depolarized potential) nor a change in postsynaptic sensitivity to glutamate, without direct assessment of AMPAR-specific and NMDAR-specific input-output curves, it doesn't definitively exclude the possibility that both AMPA and NMDA receptor currents are being upregulated, thus negating an observable change in postsynaptic strength.

      Overall, these findings provide novel insight into how the PVT-NAcSh pathway is altered by opioid self-administration and whether this is unique based on abstinence period and sex. Importantly, these were the primary objectives stated by the author. Data highlight a potential role for the observed adaptations in relapse behavior and identify a potential therapeutic target during abstinence to reduce relapse risk in abstaining individuals. However, it should be noted that no causal link is demonstrated without experiments to reduce/prevent relapse.

      Comments on revisions:

      The authors addressed previous concerns brought up, specifically by clarifying data interpretation as well as text modifications related to potential caveats of these interpretations. However, I recommend that the title be changed to not focus on sex differences to avoid misunderstanding. The authors should also address the lack of difference physiologically compared to the behavior as a caveat more clearly in the discussion (i.e. likely suggests this isn't the pathway driving the difference).

    3. Reviewer #2 (Public review):

      Summary:

      This is an interesting paper from Alonso-Caraballo and colleagues that examines the influence of opioid use, acute and prolonged abstinence, and sex on cue-induced relapse and paraventricular thalamus (PVT) to nucleus accumbens shell (NAcSh) medium spiny neurons circuit physiology. The study presents a valuable finding that following prolonged, but not acute abstinence from oxycodone self-administration, female rodents exhibit higher relapse rates to drug paired cues. Additionally, the study presents the useful finding that prolonged abstinence increased PVT-NAcSh MSN synaptic strength in both sexes, an effect that is likely due to presynaptic adaptations. While the evidence to support these two findings is solid, further experiments are required to determine the functional role of the PVT-NAcSh MSN circuit in relapse following prolonged oxycodone abstinence, and the mechanism underlying the heightened relapse vulnerability in females in this model of opioid use disorder.

      Strengths:

      The paper is interesting, well written and presented, and the experiments are well designed and conducted. The revised analysis of spike count data that models the hierarchical structure of the data is appropriate to overcome low animal numbers and the potential for oversampling. The authors are transparent in reporting the results related to this analysis in figure 5 and acknowledge the study is underpowered to confirm the trend of increased intrinsic excitability in male MSNs following prolonged oxycodone analysis.

      Weaknesses:

      A major weakness of this study is the disconnect between the behavioral and neurophysiological data reported. While a striking sex difference in relapse-like behavior is observed, there are no statistically significant sex differences in any of the neurophysiological data reported. Moreover, without an experiment to functionally test the role of the PVT-NAc projection in relapse-like behavior following prolonged oxycodone these two arms of the study seem divorced.

      While the authors don't directly conclude that the PVT-NAc MSN circuit is required for relapse following prolonged oxycodone abstinences, in the introduction the authors state they aim to test the hypothesis that increased synaptic strength in PVT-NAcSh projections are necessary for drug-seeking. This study does not include the required experiments to test this hypothesis.

      Impact:

      The topic is of interest to the field of substance use disorders and gives solid evidence for the need to consider targeted therapeutics aimed at relapse prevention in opioid use disorder.

    4. Reviewer #3 (Public review):

      Summary:

      Alonso-Caraballo et al. use behavioral testing and ex vivo patch-clamp electrophysiology combined with circuit-specific optogenetic stimulation of PVT terminals to examine how oxycodone self-administration and abstinence duration shape cue-induced relapse and PVT-NAcSh synaptic transmission in male and female rats. In the revision, the authors reanalyzed intrinsic excitability using nested hierarchical GLMMs, acknowledged the low power in the male prolonged-abstinence group, and expanded the discussion of relevant PVT-NAc literature. These changes improve the manuscript. That said, most of the revisions are textual and the main experimental gap remains. Both sexes show increased oxycodone seeking compared to saline at 14 days, but only females show a time-dependent incubation from 1 to 14 days, and the PVT-NAcSh synaptic strengthening is the same in both sexes. Nothing in the revision brings those two observations closer together. The excitability data also come from NAcSh MSNs with no confirmation of PVT connectivity, which limits what circuit-specific conclusions can be drawn. The study is a solid characterization of abstinence-related synaptic changes in this pathway, but some of the conclusions still go further than the data allow.

      Strengths:

      The behavioral characterization is thorough and well-executed, covering self-administration, somatic withdrawal, and cue-induced relapse across two abstinence durations in both sexes. The sex-specific escalation in oxycodone seeking from 1 to 14 days in females but not males is a clear and compelling finding. The use of circuit-specific ex vivo optogenetics to isolate PVT terminal inputs onto NAcSh neurons is a genuine methodological strength, and the demonstration of feedforward inhibitory recruitment through local GABAergic interneurons adds meaningful novelty to the circuit characterization. The reanalysis of intrinsic excitability using nested hierarchical GLMMs appropriately accounts for the non-independence of cells recorded within the same animal and is a real improvement over the original approach. The expanded discussion of prior PVT-NAc work, particularly the more accurate treatment of Keyes et al. (2020) and Paniccia et al. (2024), better situates the findings within the existing literature.

      Weaknesses:

      The core limitation of the study remains unchanged after revision. The PVT-NAcSh synaptic strengthening after prolonged abstinence is statistically indistinguishable between sexes, while females but not males show a time-dependent escalation in oxycodone seeking from 1 to 14 days of abstinence. The Discussion proposes hormonal modulation or differences in upstream inputs as possible explanations, but none of these are tested and the gap is left unresolved. The intrinsic excitability recordings come from NAcSh MSNs with no confirmation that those neurons receive direct PVT input, which was raised in the original review, acknowledged in the revision, and not experimentally addressed. The male prolonged-abstinence excitability trend has approximately 20% statistical power and is non-significant, yet the Discussion interprets it as a potential neuroadaptation that could facilitate signal flow through the PVT-NAcSh circuit and contribute to relapse, which goes well beyond what the data support. The failure to distinguish between D1 and D2 MSNs remains a significant limitation given that cell-type-specific plasticity at PVT-NAc synapses has been shown to be directly relevant to opioid seeking in prior work. Finally, the Conclusion builds a mechanistic framework around D2 MSNs, PV interneurons, and D1 MSNs that is drawn from studies using different drugs or experimental designs, and none of these cell-type-specific mechanisms are tested in the present experiments.

    5. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      (1) Although not undermining these data, there are a few potential weaknesses that reduce the impact of the work. For example, the inability to directly assess whether cue-induced drug-seeking is in fact augmented compared to daily intake during self-administration in the maintenance face only permits the authors to denote that re-exposure to cues and the context is sufficient to promote active lever pressing without demonstrating whether seeking behavior is in fact elevated further during a cue test. This is notably understandable as drug available sessions were 6-hours versus a 1-hour relapse test. Importantly, it is clearly demonstrated that drug seeking is higher on average in female mice after 14 days versus 1 day.

      We agree that the current design does not allow us to directly assess whether cue induced drug-seeking is augmented relative to the average self-administration intake. However, this comparison was not a question examined in the manuscript and was not an intended interpretation of the data. Our analyses and interpretations focused on comparisons between saline and oxycodone groups tested under identical cue-induced relapse conditions. While it does not change or contradict the reviewer’s point, we would also like to clarify that the relapse test was 2 hours long.

      (2) With regard to the interpretation of electrophysiology findings, the lack of inclusion of an abstinence-only group does not permit interpretations to parse out whether observed increases in synaptic strength (or the lack of) reflect abstinence or an interaction between abstinence period and re-exposure to the operant chamber, as slices were taken 30-45 min post relapse test.

      The inclusion of an abstinence-only control group would have been required to definitively dissociate synaptic changes driven by abstinence alone from those arising from an interaction between abstinence and re-exposure to the operant context during the relapse test. In the present study, electrophysiological recordings were intentionally performed 30 to 45 minutes following the relapse test to capture synaptic modifications associated with cue-induced drug-seeking after abstinence. Accordingly, we interpret these findings as reflecting the neural state following relapse rather than abstinence alone, and we have revised the text accordingly to clarify this point.

      (3) With regard to the interpretation of electrophysiology findings, the lack of inclusion of an abstinence-only group does not permit interpretations to parse out whether observed increases in synaptic strength (or the lack of) reflect abstinence or an interaction between abstinence period and re-exposure to the operant chamber, as slices were taken 30-45 min post relapse test. While much literature has shown that drug-induced adaptations in the NAc require a post-drug period for plasticity to measurably emerge, studies have also shown that re-exposure to heroin-associated cues following abstinence seemingly "reverses" increases in cell excitability in prelimbic-NAc pyramidal neurons (Kokane et al., 2023) and that depotentiation of morphine-induced increases in synaptic strength in the NAc shell can be depotentiated by drug re-exposure - an effect also observed with cocaine re-exposure (Madayag et al., 2019). Notably, the lack of effect at 14 but not 1 day supports the likelihood that the relapse test does not in fact influence the plasticity within the PVT-NAcSh circuit.

      We thank the reviewer for highlighting relevant literature showing that drug or cue re exposure can modify or reverse drug-induced plasticity in NAc-related circuits. We want to clarify that, in our dataset, synaptic changes in the PVT-NAcSh pathway are seen after 14 days of abstinence, but not after 1 day. Therefore, the lack of effect at the earlier time point and its appearance after extended abstinence support the idea of time-dependent plasticity. Although electrophysiological recordings were taken soon after the relapse test, this temporal pattern argues against relapse testing alone as the primary driver of the observed synaptic changes. We have updated the text to clarify this point.

      (4) While the lack of effect on AMPAR:NMDAR ratio and rectification indices do support the notion that enhanced EPSC amplitudes in input-output curves do not reflect a change in AMPAR subunit expression (i.e., increased GluA2-lacking receptors that exhibit inward rectification at depolarized potential) nor a change in postsynaptic sensitivity to glutamate, without direct assessment of AMPAR-specific and NMDAR-specific input output curves, it doesn't definitively exclude the possibility that both AMPA and NMDA receptor currents are being upregulated, thus negating an observable change in postsynaptic strength.

      We agree that unchanged AMPAR/NMDAR ratios and rectification index suggest against altered AMPAR subunit composition or simple postsynaptic sensitivity changes. Although receptor-specific input-output analyses would be necessary to definitively rule out proportional increases in both AMPA and NMDA receptor currents, we have updated the manuscript to clarify that our conclusions are limited to the synaptic measures we obtained. The revised text now states that acute or prolonged abstinence “might have no detectable postsynaptic effects as assessed by these synaptic measures” at PVT-NAcSh synapses.

      Reviewer #2 (Public review):

      (5) While this paper is certainly interesting, and well-written, and the experiments seem to be well performed, the behavioral and physiological effects observed are somewhat divorced. Specifically, what accounts for the heightened relapse in females? Since no opioid-related sex differences were observed in PVT-NAcSh neurophysiology, it is unclear how the behavioral and neurophysiological data fit together. Furthermore, the lack of functional manipulation of PVT-NAcSh circuitry leaves one to wonder if this circuit is even important for the behavior that the authors are measuring. I would be more positive about this study if the authors were able to resolve either of the two issues noted above.

      A key challenge in circuit-based studies of motivated behavior is connecting circuit-level plasticity to complex, sex-dependent behavioral phenotypes. In this study, we do not mean to imply that synaptic plasticity within the PVT-NAcSh projection alone explains the increased relapse seen in females. Instead, our electrophysiological data indicate that this projection experiences time-dependent, abstinence-dependent changes in synaptic strength, offering important insights into when and where circuit-level adaptations may occur. We also believe that the lack of obvious sex differences in PVT-NAcSh synaptic strength does not rule out this circuit's role in sex-specific behavior. Growing evidence suggests that sex differences in relapse and motivated behaviors may stem from different modulation of shared circuits (for example, via ovarian hormones, neuromodulatory tone, or upstream inputs), rather than from significant differences in baseline synaptic properties within a given projection. Regarding circuit relevance, extensive previous research has identified the PVTNAcSh pathway as a critical regulator of cue-induced reward seeking and relapse. Our findings expand on this by showing that this projection displays abstinence-dependent synaptic strengthening after oxycodone self-administration. Although functional manipulation of this circuit is needed to confirm its causal role, such experiments were beyond the scope of this study.

      (6) There are insufficient animals in some cases. For example, in Figure 4, the Male Saline 14-day abstinence group (n = 3 rats) has less than half of the excitability as compared to the Male Saline 1-day abstinence group (n = 7 rats). This is likely due to variance between animals and, possibly, oversampling. Thus, more rats need to be added to the 14-day abstinence group. Additionally, the range of n neurons/rat should be reported for each experiment to ensure readers that oversampling from single animals is not occurring.

      We appreciate the reviewer's concern regarding the number of animals and the potential for oversampling. We take this concern seriously and have substantially revised our statistical approach in response.

      All spike count data were reanalyzed using nested hierarchical Poisson generalized linear mixed-effects models (GLMMs), fitted separately for each sex and abstinence duration. Each model included injected current (mean-centered), drug condition, and their interaction as fixed effects, with random intercepts and slopes for injected current at the animal level, and random intercepts for cells nested within animals. Importantly, this reanalysis changed several of our original conclusions. Effects that appeared significant under the conventional cell-level analysis were no longer statistically significant once the hierarchical structure of the data was properly modeled. We report these corrected results transparently throughout the revised manuscript.

      However, in males after prolonged abstinence, oxycodone-treated animals showed a higher spike output than controls, with a large effect size. Post-hoc analysis showed only 20% power with current sample (3 saline, 4 oxycodone rats). To reach 80% power, 13 rats per group are needed. We report this as a trend that warrants further study and have revised related sections to reflect this. The data suggest a possible neuroadaptation in males that the study is underpowered to confirm, not a null effect.

      In response to this comment, we have updated Figure 5, the Results and Discussion sections, and the Statistics/Methods section to clearly describe the nested hierarchical modeling approach, report corrected statistical values, and acknowledge the power limitation for the male prolonged abstinence group. The figure legend now reports the number of neurons recorded per rat, showing the distribution across animals rather than individual subjects.

      (7) The IPSC data, for example in Figure 4, is one of the more novel experiments in the manuscript. However, it is quite challenging to see the difference between males and females, saline and oxycodone, at low stimulation intensities within the graph. Authors should expand this so that reviewers/readers can see those data, especially considering other work suggesting that PVT synaptic input onto select NAc interneurons is disrupted following opioid self-administration. Additional comment: It's also interesting that the IPSC amplitude seems to be maximal at ~2mW of light, whereas ~11 mW is required to evoke maximal EPSC amplitude. It would be interesting to know the authors' thoughts on why this may be.

      While visual separation between conditions at low light levels is subtle, we addressed this directly using linear mixed-effects modeling, which evaluates IPSC amplitudes across the full range of stimulation intensities while accounting for repeated measurements from cells nested within animals. This approach provides greater sensitivity than visual inspection alone and avoids over interpretation of noise at individual stimulation levels.

      Using this framework, we observed robust main effects of light intensity in both males and females, indicating preserved recruitment of inhibitory synaptic responses as stimulation increased. Importantly, no significant Light × Condition interactions were detected in either sex, indicating that the scaling of IPSC amplitudes with light intensity was not altered by oxycodone exposure.

      With respect to the observation that IPSC amplitudes appear to reach near-maximal levels at lower light intensities (~2 mW) compared to EPSCs (~11 mW), we agree that this distinction is intriguing. One possible explanation is that the depend on the recruitment of local interneurons. However, the number of interneurons activated by PVT interneurons is limited and inhibitory responses may reach a plateau at relatively low light intensities once these interneurons are fully recruited.

      On the other hand, the increased intensity of photostimulation would result in an increase of monosynaptic EPSC amplitude over a wider range of stimulation (light) intensities, as increased intensity of light would recruit more ChR2-expressing PVT fibers, resulting in larger EPSCs.

      (8) There is an inadequate description of what has been done to date on the PVT-NAc projection regarding opioid withdrawal, seeking, disinhibition, and the effects on synaptic physiology therein. For example, a critical paper, Keyes et al., 2020 Neuron, is not cited. Additionally, Paniccia et al., 2024 Neuron is inaccurately cited and insufficiently described. Both manuscripts should be described in some detail within the introduction, and the findings should be accurately contextualized within the broader circuit within the discussion.

      In the revised manuscript, we expanded the Discussion to give a more thorough overview of previous research on the PVT-NAc pathway in relation to opioid-related behaviors and synaptic changes. Specifically, we added more detail about Keyes et al., 2020 and Paniccia et al., 2024, clarifying their findings and placing them within the context of the circuit mechanisms studied in our work. We also revised the text to ensure the descriptions of these studies are accurate and that their conclusions are properly related to our findings.

      (9) Related to the above, the authors should provide a more comprehensive description of how PVT synapses onto cell-type specific neurons in the NAc which expand beyond MSNs, especially considering that PVT has been shown to influence drug/opioid seeking through the innervation of NAc neurons that are not MSNs. For example, see PMIDs 33947849, 36369508, 28973852, 38141605.

      In the revised manuscript, we expanded the Discussion to describe the diversity of PVT projections within the NAc and the potential role of non-MSN neuronal populations in drug-related behaviors. We added discussion on the broader circuit context and other cell types where relevant to the focus on synaptic transmission onto MSNs. Since our experiments specifically examined synaptic physiology in MSNs, we focused the literature discussion on studies most directly related to MSNtargeted PVT inputs and opioid-related behaviors.

      Reviewer #3 (Public review):

      (10) Additional experiments could strengthen the results and help clarify synaptic mechanisms underpinning behavioral sex differences.

      We agree that additional experiments focused on identifying cell-type-specific mechanisms within the PVT-NAcSh circuit would further enhance understanding of the neural substrates behind the observed behavioral sex differences. In the revised manuscript, we have expanded the Discussion to explicitly acknowledge these limitations and clarify the scope of our current study. Specifically, we discuss the possibility that sex-specific adaptations might occur in particular neuronal subpopulations or circuit components that were not resolved in the present experiments. We also mention that future research using cell-type–specific approaches will be necessary to determine if such mechanisms contribute to the increased oxycodone seeking seen in females after prolonged abstinence. We appreciate the reviewer’s suggestions and have incorporated this perspective into the revised manuscript to better contextualize our findings and outline future directions.

    1. eLife Assessment

      This study investigates the role of the Z-disc protein Zasp52 in Drosophila flight muscles and provides evidence that an intrinsically disordered region (IDR) helps to stabilize and promote the localization of the protein to the Z-disc. Overall, this represents an important study that provides insights into Z-disc function and maintenance. The data are convincing, supported by strong genetic evidence and behavioral tests, well-controlled experiments, and detailed statistical analyses. Additional functional analyses designed to tease out specialized regions within the newly described isoform of Zasp52 would further strengthen models regarding the function of the protein.

    2. Reviewer #1 (Public review):

      The manuscript by Ho and Schock investigates the role of the Z-disc protein Zasp52 during Drosophila flight muscle development. It was known before, mainly by findings from this group, that Zasp52 is required for normal sarcomere morphogenesis, specifically Z-disc morphogenesis in indirect flight muscles. But the exact molecular mechanism by which Zasp52 contributes, apart from the fact that it is localised there and is somehow involved in multimerization/cross-linking, was not clear. This paper proposes that an intrinsically disordered region (IDR) in Zasp52 is needed for some of its functions, by stabilising Zasp52 localisation at the Z-disc. Specifically, the IDR in Zasp52 is proposed to be required for Z-disc maintenance during the mechanical challenges of flight, while being dispensable for the initial morphogenesis during development. This hypothesis is supported by strong genetic evidence and behavioural tests, deleting Zasp's IDR impairs flight from mid-age onwards, while a block in flight activity lifts the phenotype.

      However, some of the phenotypic analysis, in particular the bending of the sarcomere, likely upon mechanical challenge by muscle contractions, needs more detailed investigations to be fully convincing.

      Strengths:

      (1) The linker in the alternatively spliced exon 15 of Zasp52 was deleted with a state-of-the-art genetic editing strategy. Surprisingly, flies are homozygous viable, showing that this long part of the Zasp52 protein is not essential for animal survival or sarcomere morphogenesis.

      (2) The observed sarcomere phenotypes with age, especially the bending Z-discs, are new and exciting.

      (3) The displayed EM images document interesting phenotypes.

      (4) Most of the observed phenotypes can be rescued by re-expression of the long Zasp52 isoform, which does contain the IDR region, but not by a shorter one without it, suggesting that IDR is important.

      (5) FRAP data measure the local turnover of a short-ZaspGFP and show that this increased in the Zasp mutant lacking the IDR domain, suggesting that Zasp-IDR might stabilise Zasp at the Z-disc.

      (6) Interestingly, flight and sarcomere morphology phenotypes can be rescued by preventing the flies from flying, suggesting that they are mechanically induced.

      Weaknesses:

      (1) The western blot quantifications of Zasp isoform expression are weak. No error bars are indicated in the quantifications; the quantifications appear to be more qualitative than quantitative. According to band intensities, the long Zasp isoforms seem to be less present compared to the shorter ones, even in the flight muscles.

      (2) The phenotypic analysis of the sarcomere appears somewhat superficial throughout the paper. Only Zasp52 and phalloidin are shown; no other Z-disc or thick filament proteins. At least myosin stainings and overview images are important to better judge the phenotypic variations. Are the variants between individuals or regional in the same muscle?

      (3) EM images would benefit from better quantification.

      (4) Other proteins were not analysed with the FRAP-based turnover assay for comparison in wild type and mutant. All Z-proteins might turn over faster in the mutant with the defective Z-disc.

    3. Reviewer #2 (Public review):

      Summary and Strengths:

      This in-depth genetic analysis of Zasp52 function in Drosophila indirect flight muscle (IFM) provides an interesting perspective regarding the role of a partially disordered region (IDR) in exon 15e. This exon seems to be exclusively present in IFM and contributes to the prevention of myofibril disintegration during aging, likely due to interactions of this region with Z-disc insertion and/or stability. The addition of an isoform (PR) that lacks exon 15e serves as a nice control to illustrate the necessity of exon 15e in muscle structure and function. Overall, the manuscript is exceptionally well-written, logical, with nicely controlled experiments and detailed statistical analysis that largely support the conclusions drawn by the authors. While exon 15e is clearly involved in preventing muscle degeneration, a solid role for thin filament stability is not clearly shown (as mentioned in the abstract). In addition, which regions/how the proteins of the IDR may contribute are unclear.

      Weaknesses:

      (1) It is not clear in Figure S1A where exon 15e fits within the Zasp52 locus schematic. This is important as a premise of this paper describes this region to be key, and proof from multiple prediction programs would lend more weight to the prediction of the exon being largely disordered. Inclusion of the discussed short linear motifs, comparison with Canoe or LBD3 for similarities and/or an Alphafold structure would help make the authors' point (colorized with known domains).

      (2) Interesting that immobilization rescues the deterioration phenotypes. The authors should explain in more detail how this was done to avoid dehydration/starvation of the flies.

      (3) There is a lot of discussion about the potential function of the IDR region, specifically a putative actin binding motif or other 'ordered' regions that may contain short linear motifs. It would strengthen the findings to show which of these may be essential for Zasp52 function in the IFM. The ability to bind actin could be tested biochemically, and/or smaller deletions could be made to unequivocally test the role of the ABD vs other predicted motifs using genetics. If some of these regions are more ordered, where do they lie within, and do they form a predicted fold or structure that gives insight into function?

    4. Author response:

      Public Reviews:

      Reviewer #1 (Public review):

      The manuscript by Ho and Schock investigates the role of the Z-disc protein Zasp52 during Drosophila flight muscle development. It was known before, mainly by findings from this group, that Zasp52 is required for normal sarcomere morphogenesis, specifically Z-disc morphogenesis in indirect flight muscles. But the exact molecular mechanism by which Zasp52 contributes, apart from the fact that it is localised there and is somehow involved in multimerization/cross-linking, was not clear. This paper proposes that an intrinsically disordered region (IDR) in Zasp52 is needed for some of its functions, by stabilising Zasp52 localisation at the Z-disc. Specifically, the IDR in Zasp52 is proposed to be required for Z-disc maintenance during the mechanical challenges of flight, while being dispensable for the initial morphogenesis during development. This hypothesis is supported by strong genetic evidence and behavioural tests, deleting Zasp's IDR impairs flight from mid-age onwards, while a block in flight activity lifts the phenotype.

      However, some of the phenotypic analysis, in particular the bending of the sarcomere, likely upon mechanical challenge by muscle contractions, needs more detailed investigations to be fully convincing.

      Strengths:

      (1) The linker in the alternatively spliced exon 15 of Zasp52 was deleted with a state-of-the-art genetic editing strategy. Surprisingly, flies are homozygous viable, showing that this long part of the Zasp52 protein is not essential for animal survival or sarcomere morphogenesis.

      (2) The observed sarcomere phenotypes with age, especially the bending Z-discs, are new and exciting.

      (3) The displayed EM images document interesting phenotypes.

      (4) Most of the observed phenotypes can be rescued by re-expression of the long Zasp52 isoform, which does contain the IDR region, but not by a shorter one without it, suggesting that IDR is important.

      (5) FRAP data measure the local turnover of a short-ZaspGFP and show that this increased in the Zasp mutant lacking the IDR domain, suggesting that Zasp-IDR might stabilise Zasp at the Z-disc.

      (6) Interestingly, flight and sarcomere morphology phenotypes can be rescued by preventing the flies from flying, suggesting that they are mechanically induced.

      Weaknesses:

      (1) The western blot quantifications of Zasp isoform expression are weak. No error bars are indicated in the quantifications; the quantifications appear to be more qualitative than quantitative. According to band intensities, the long Zasp isoforms seem to be less present compared to the shorter ones, even in the flight muscles.

      We will work on including quantifications with error bars for the Western blots in our resubmission. It is important to keep in mind that the main point in figure 1B is that there are plenty of exon15e-containing isoforms in IFM, in contrast to other tissues with very limited exon15e-containing isoforms. This is confirmed by the analysis of RNAseq data in figure 1C, and of course, by the flightless phenotype of the exon15e mutant.

      (2) The phenotypic analysis of the sarcomere appears somewhat superficial throughout the paper. Only Zasp52 and phalloidin are shown; no other Z-disc or thick filament proteins. At least myosin stainings and overview images are important to better judge the phenotypic variations. Are the variants between individuals or regional in the same muscle?

      Our images are representative of the observed phenotypes. We aim to provide overview images and other stainings to better illustrate the phenotypic variations in the revised version. Phenotypes are consistently present across all individuals, as reflected in our replicates. Interestingly, they appear to not be randomly interspersed among the sarcomeres but concentrated in certain regions of muscle more than others.

      (3) EM images would benefit from better quantification.

      We do not believe that EM images can be meaningfully quantified, because of the many selection steps preceding image acquisition.

      (4) Other proteins were not analysed with the FRAP-based turnover assay for comparison in wild type and mutant. All Z-proteins might turn over faster in the mutant with the defective Z-disc.

      This is the point we are trying to make. The Zasp52 IDR acts like a glue stabilizing all Z-disc proteins. We performed this experiment as a first step to explore whether an exon15e-lacking system exhibited modified dynamics, and we aim to provide more data in the revised version.

      Reviewer #2 (Public review):

      Summary and Strengths:

      This in-depth genetic analysis of Zasp52 function in Drosophila indirect flight muscle (IFM) provides an interesting perspective regarding the role of a partially disordered region (IDR) in exon 15e. This exon seems to be exclusively present in IFM and contributes to the prevention of myofibril disintegration during aging, likely due to interactions of this region with Z-disc insertion and/or stability. The addition of an isoform (PR) that lacks exon 15e serves as a nice control to illustrate the necessity of exon 15e in muscle structure and function. Overall, the manuscript is exceptionally well-written, logical, with nicely controlled experiments and detailed statistical analysis that largely support the conclusions drawn by the authors. While exon 15e is clearly involved in preventing muscle degeneration, a solid role for thin filament stability is not clearly shown (as mentioned in the abstract). In addition, which regions/how the proteins of the IDR may contribute are unclear.

      Weaknesses:

      (1) It is not clear in Figure S1A where exon 15e fits within the Zasp52 locus schematic. This is important as a premise of this paper describes this region to be key, and proof from multiple prediction programs would lend more weight to the prediction of the exon being largely disordered. Inclusion of the discussed short linear motifs, comparison with Canoe or LBD3 for similarities and/or an Alphafold structure would help make the authors' point (colorized with known domains).

      We will add a bar below figure S2A to show the region corresponding to exon 15e. We used three disorder prediction programs and one structure (order) prediction program. The majority of exon15e is completely disordered and of very low confidence score, and thus uninformative to display as an Alphafold structure. Likewise, IDR’s are very difficult to classify, therefore we cannot say much more than that LDB3, Zasp52, and Canoe contain IDRs, with Zasp52 and Canoe both having an actin-binding domain within the IDR. We will provide more data on the function of the ABD in the revised version.

      (2) Interesting that immobilization rescues the deterioration phenotypes. The authors should explain in more detail how this was done to avoid dehydration/starvation of the flies.

      We will provide more details in the revised version.

      (3) There is a lot of discussion about the potential function of the IDR region, specifically a putative actin binding motif or other 'ordered' regions that may contain short linear motifs. It would strengthen the findings to show which of these may be essential for Zasp52 function in the IFM. The ability to bind actin could be tested biochemically, and/or smaller deletions could be made to unequivocally test the role of the ABD vs other predicted motifs using genetics. If some of these regions are more ordered, where do they lie within, and do they form a predicted fold or structure that gives insight into function?

      We will provide data on the function of the ABD in the revised version.

    1. eLife Assessment

      This important study identified Mex3a protein with dual RNA-binding protein/ubiquitin ligase function as a pivotal regulator of olfactory sensory neurons (OSN) differentiation and lineage fidelity. The authors employed a combination of systems biology approaches (e.g., single-cell RNA sequencing, proteomics) and newly developed animal models (e.g., HyperTRIBE) to provide solid evidence that abrogation of Mex3a disrupts cilia structure and polarity of OSNs. Notwithstanding that this article is of a broad potential interest across different biomedical disciplines ranging from RNA to developmental biology, additional mechanistic data connecting identified Mex3a mRNA targets and ensuing OSN phenotypes would further strengthen this study.

    2. Reviewer #1 (Public review):

      The study by Escamilla del Arenal et al. utilized a conditional knockout mouse model to study the role of Mex3a in immature olfactory sensory neurons (OSN). Mex3a is a dual-functional protein that has RNA-binding function and ubiquitin-E3 ligase activity. The results revealed that Mex3a expression is critical for proper OSN differentiation and contributes to cell surface protein trafficking and translation, cilia structure, and planar cell polarity in mature neurons. Moreover, Mex3a enforces lineage fidelity, selectively repressing sustentacular programs in neurons and neuronal programs in sustentacular cells.

      In addition, the authors established an in vivo HyperTRIBE mouse model to identify Mex3a RNA targets and incorporated UbiFast into the Mex3a conditional knockout (cKO) model to find its protein targets to investigate how Mex3a regulates OSN differentiation. The experimental systems are laborious and comprehensive, which allowed the authors to identify new Mex3a putative targets in OSN.

      The phenotypic results derived from the conditional Mex3a cKO mice are solid. Mechanistic findings also revealed that, in addition to facilitating protein degradation, Mex3a may confer K27 ubiquitin linkage on its target proteins, which has a non-proteolytic role but affects target protein activity, other post-translational modifications, or protein-protein interactions. However, among all Mex3a putative targets, the authors decided to emphasize on the Mex3a-mediated K27 ubiquitination on stress granule protein Serbp1 and ribosome protein Rps7, and the association between Mex3a expression and Serbp1 and p-eEF2 ribosome recruitment. This Mex3a-Serbp1-p-eEF2 ribosome recruitment axis, although it can be important in Unfolded Protein Response (UPR) signaling, seems rather general and cannot explain the striking lineage-specific phenotypes observed in the mouse model. The authors need to provide more solid evidence to demonstrate that K27-Ubiquitinylation of Serbp1 is a key step of Mex3a function in OSN differentiation to strengthen the relation between the phenotypes and mechanism presented in this study.

    3. Reviewer #2 (Public review):

      Summary:

      In this manuscript, Arenal and colleagues demonstrate that loss of Mex3a leads to defects in cell surface protein trafficking, translation, ciliary structure, and planar cell polarity in mature neurons. Through proteomic analyses, the authors show that Mex3a depletion alters the abundance of proteins involved in vesicular transport, lipid metabolism, and ribosome biogenesis. Using the HyperTRIBE approach, the authors further identify targets of Mex3a and provide evidence supporting a role for K27-linked ubiquitination in regulating these substrates. Mechanistically, the study suggests that Mex3a levels influence the recruitment of SERBP1 and phosphorylated eEF2 (p-eEF2) to ribosomes, contributing to translational repression.

      Strengths:

      Overall, this is a very interesting and well-written manuscript that significantly advances our understanding of Mex3a function and its role in neuronal development, particularly in olfactory sensory neurons. The data are clearly presented and thoughtfully interpreted.

      Weaknesses:

      I have a few minor comments that may further strengthen the manuscript and improve its accessibility to a broader readership.

      (1) In Figure 3B, the authors describe Mex3a localization to cytoplasmic granules. However, it is unclear how these compartments were defined. It would strengthen the conclusions if the authors included co-localization experiments using established cytoplasmic granule markers (e.g., stress granule markers) to define the identity of these structures more precisely. This would clarify whether Mex3a associates with stress granules, RNA processing bodies, or another class of ribonucleoprotein granules.

      (2) Functional validation of K27-linked ubiquitination on SERBP1<br /> To further define the functional significance of K27-linked ubiquitination, it would be informative to mutate the relevant lysine residue(s) on SERBP1 and examine whether this alters its recruitment to ribosomes or affects translational repression. Such an experiment would provide more direct evidence that K27-linked ubiquitination of SERBP1 mediates the observed translational effects.

      (3) Discussion of vesicular trafficking and lipid metabolism targets<br /> The identification of Mex3a targets involved in vesicular trafficking and lipid metabolism, including COPII coat components such as Sec31a and lipid regulatory proteins such as Sec14 and PIP5K1A, is particularly intriguing. The authors may wish to expand the Discussion to address how regulation of these proteins could contribute to defects in plasma membrane trafficking and planar cell polarity. Integrating these findings with the observed cell surface trafficking phenotypes would further enhance the mechanistic framework of the study.

    4. Reviewer #3 (Public review):

      Summary:

      In this manuscript, the authors investigate the role of the KH and RING domain-containing protein Mex3a in the differentiation and maturation of olfactory sensory neurons. Using conditional knockout of Mex3a in immature neurons, they show that mature olfactory sensory neurons display defects in membrane protein trafficking, including olfactory receptors and Adcy3, together with abnormalities in ciliary radial organization and planar cell polarity. Through single-cell RNA sequencing and quantitative proteomics, the authors further show that Mex3a-deficient neurons fail to properly resolve the unfolded protein response and exhibit transcriptomic features suggestive of lineage mixing with sustentacular cells. The study also introduces a methodological advance by adapting HyperTRIBE for use in transgenic mice, which enables the identification of in vivo Mex3a RNA targets, including components of Wnt signaling that appear to be under translational repression by Mex3a. The authors then pursue one of these targets to further explore the role of Mex3a in translational repression.

      Strengths:

      First, it addresses an important biological and conceptual question. Mex3a is a multifunctional protein with the potential to couple RNA regulation, protein homeostasis, and key cellular processes, yet its in vivo role in neuronal differentiation remains poorly understood. By focusing on Mex3a in olfactory sensory neurons, the manuscript asks a timely and important question of how post-transcriptional regulation contributes to the maturation of highly specialized neurons, including the establishment of ciliary architecture, membrane protein trafficking, and cell polarity. Second, the generation and validation of an inducible in vivo mouse HyperTRIBE system represents a technical advance. By incorporating the Adar deaminase domain into a transgenic mouse model, the authors establish a rigorous and useful approach for identifying Mex3a RNA targets in vivo, which is likely to be valuable to the wider RNA biology community. Third, the study integrates the Mex3a knockout model with single-cell RNA sequencing, quantitative mass spectrometry-based proteomics, ubiquitin profiling, and ribosome-related analyses, providing a broad and multilayered view of the Mex3a knockout phenotype. Finally, the imaging analyses revealing altered ciliary content and organization in olfactory sensory neurons identify an interesting and potentially important link between Mex3a, cilia biology, and vesicular trafficking. More broadly, the manuscript reflects a very substantial experimental effort, and each individual dataset has the potential to be useful for the field.

      Weaknesses:

      A main weakness of the manuscript is that the mechanistic links between the major findings remain somewhat correlative, and the biological narrative is not fully sustained through the later figures. The study documents defects in membrane trafficking, ciliary radial organization, and planar cell polarity, and it identifies candidate targets with clear relevance to these processes, including factors linked to vesicle trafficking. However, the manuscript then shifts its mechanistic focus toward translational regulators such as Serbp1 and Rps7, without adequately connecting these later analyses back to the core phenotypes established earlier. As a result, there is a noticeable disconnect between the phenotypic emphasis of the study and the mechanistic validation that follows.

      A second weakness is that, given the breadth and potential importance of the datasets generated, validation remains limited for several of the major conclusions. This reduces confidence in the interpretation of the single-cell, proteomic, ubiquitin-related, and ribosome-associated analyses, and also limits the future value of these datasets as a resource for the field. Because the manuscript aims to address several major questions at once, stronger validation and clearer integration across the different experimental arms are needed for the conclusions to feel fully supported.

      Finally, the HEK293T overexpression experiments are less solid than the in vivo analyses and do not provide equally strong support for the proposed mechanisms. In this context, some of the observed effects on cytoskeletal organization, membrane-less granule formation, and ribosome profiles may be indirect, which makes it difficult to weigh these findings alongside the much stronger in vivo phenotypes.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      The authors report the results of a tDCS brain stimulation study (verum vs sham stimulation of left DLPFC; between-subjects) in 46 participants, using an intense stimulation protocol over 2 weeks, combined with an experience-sampling approach, plus follow-up measures after 6 months.

      Strengths:

      The authors are studying a relevant and interesting research question using an intriguing design, following participants quite intensely over time and even at a follow-up time point. The use of an experience-sampling approach is another strength of the work.

      Weaknesses:

      There are quite a few weaknesses, some related to the actual study and some more strongly related to the reporting about the study in the manuscript. The concerns are listed roughly in the order in which they appear in the manuscript.

      We truly appreciate your dedicating time and efforts to review our manuscript. Yes, we do perceive that those weaknesses you raised all make sense. We agree with you on almost all the suggestions that you detailed below, particularly in clarifying statistics and sample size determination. Please see specific responses below.

      Major Comments

      (1) In the introduction, the authors present procrastination nearly as if it were the most relevant and problematic issue there is in psychology. Surely, procrastination is a relevant and study-worthy topic, but that is also true if it is presented in more modest (and appropriate) terms. The manuscript mentions that procrastination is a main cause of psychopathology and bodily disease. These claims could possibly be described as 'sensationalized'. Also, the studies to support these claims seem to report associations, not causal mechanisms, as is implied in the manuscript.

      Thank you for this very practical suggestion. We agree that the current statements to underline the importance of procrastination are somewhat overreaching. Upon revision, we have overall toned down such claims by explicitly stating them as “associative evidence”, and rewritten a portion of terms in a more modest and balanced style. Please see specific revisions in the main text below:

      Introduction Section (Page 5, Line 64-81)

      “Procrastination is increasingly becoming a prevalent behavioral problem around the world, which reflects the irrational voluntary postponement of scheduled tasks albeit being worse off for such delays (Blake, 2019; Steel, 2007). In the epidemiological investigations, more than 15% of adults were identified as having chronic procrastination problems, and the situation for students was worse as 70-80% of undergraduates engaged in procrastination (American College Health Association, 2022; Ferrari et al., 2005). Moreover, the behavioral genetic evidence indicates a certain heritability of procrastination in human beings as well (Gustavson et al., 2017; Gustavson et al., 2014, 2015). In addition to its prevalence, the undesirable associations between procrastination behavior and health also warrant cautions. There is cumulative evidence to show the close associations between procrastination behavior and working performance, financial status, interpersonal relationships, and subjective well-being (Ferrari, 1994; Pychyl & Sirois, 2016; Steel et al., 2021). Further, as the prospective cohort studies indicated, many mental health problems emerge alongside procrastination, particularly in sleep problems, depression, and anxiety (Hairston & Shpitalni, 2016; Johansson et al., 2023). Even worse, chronic procrastination behavior has been observed to impair general health, as manifested by the intimate associations with close system disruption, gastrointestinal disturbance, as well as a high risk of hypertension and cardiovascular disease (Sirois, 2015; Sirois, 2016). ... ”

      (2) It is laudable that the study was pre-registered; however, the cited OSF repository cannot be accessed and therefore, the OSF materials cannot be used to (a) check the preregistration or to (b) fill in the gaps and uncertainties about the exact analyses the authors conducted (this is important because the description of the analyses is insufficiently detailed and it is often unclear how they analyzed the data).

      We are sorry to encounter a serious technical barrier making our preregistration invisible and inaccessible. The OSF has disabled my OSF account, as it claimed to detect “suspicious user’s activities” in my account (please see the screenshot below). This results in no access to all materials already deposited in this OSF account, including this preregistration. We have contacted the OSF team, but received no valid technical solution to recover this preregistered report. We reckon that this may be triggered by my affiliation change to the Third Military Medical University of the People’s Liberation Army (PLA).

      To address this unexpected circumstance and to ensure transparency, we have explicitly reported this case in the main text, and added the “Reconstructed Preregistration Statement” into the Supplemental Materials (SM). Also, as it has been out of best practices in preregistration, in addition to transparently reporting this case, we have removed this statement regarding preregistration elsewhere throughout the whole revised manuscript. Furthermore, we fully understand the gaps of comprehending the statistics of this study, resulting from inadequate methodological details in the reporting. Therefore, we have clearly reported extensive details in the Methods section to clarify how to conduct those analyses, favoring the smooth evaluations of our conclusions. Please see what we have added in the lines below (Comments #4-9).

      Methods Section (Page 5, Line 186-191)

      “This study fully adhered to CONSORT reporting guidelines, and was originally preregistered in the OSF repository (10.17605/OSF.IO/Y3EDT). However, due to the technical constraint related to OSF account service (see SM), this OSF page is no longer accessible. For transparency and best practices of open science, based on the original protocol documentations, a preregistration statement has been reconstructed to clarify aprior hypotheses, sample size determinations, and analysis plans for this study (Table S1).”

      (3) Related to the previous point: I find it impossible to check the analyses with respect to their appropriateness because too little detail and/or explanation is given. Therefore, I find it impossible to evaluate whether the conclusions are valid and warranted.

      Again, we apologize for confusing you because of inadequate statistical and methodological details. As you may know, this manuscript has ever been reviewed by Nature Human Behaviour, which editorially constrained the paper length. Thus, a substantial number of details had to be omitted or removed. As you kindly suggested, we have diligently added extensive descriptions to clarify how we carried out statistical analyses in the present study. Please see specific instances underneath.

      (4) Why is a medium effect size chosen for the a priori power analysis? Is it reasonable to assume a medium effect size? This should be discussed/motivated. Related: 18 participants for a medium effect size in a between-subjects design strikes me as implausibly low; even for a within-subjects design, it would appear low (but perhaps I am just not fully understanding the details of the power analysis).

      Thank you for raising this crucial question. We have determined this a priori effect size based on the existing work we published previously (Xu et al., 2023, J Exp Psychol Gen;152(4):1122-1133). In our pilot study (Xu et al., 2023), we identified a significant interaction effect between the single-session tDCS stimulation (active vs sham) and time (pre-test vs post-test) (t = 2.38, p = .02, n = 27; 95% CI [0.14, 1.49]) for changing procrastination willingness in the laboratory settings, indicating a medium effect size. Therefore, this pilot study provides supportive evidence to determine this effect size a priori. To clarify, we have explicitly justified the selection of this effect size in the Methods section.

      Methods Section (Page 5, Line 206-215)

      “A full randomized block design was used to assign participants to both groups (active neuromodulation group, NM; sham-control group, SC) (see Fig. 2C). As the pilot study probing into the effect of single-session tDCS stimulation to change procrastination willingness indicated (t = 2.38, p = .02, 95% CI [0.14, 1.49]; Xu et al., 2023), statistical power was predetermined by G*Power at a relatively medium effect size (1-β err prob = 0.80, f = 0.25), yielding the total sample size at 18 to reach acceptable power (see SM Methods and Fig. S1)....”

      We fully understand that this sample size to reach a medium effect size is seemingly low, and that the18 participants for each group are apparently limited in any case. Upon double-checking these power analyses, we confirmed that this sample size requirement is indeed correct. Please see the G*Power outputs in Author response image 1.

      Author response image 1.

      Despite the absence of algorithmic errors in the power analysis here, we are aware that this limited sample size may hamper statistical robustness. To tackle this weakness, we have clearly warranted such cautions in the Limitation section:

      Limitations Section (Page 12, Line 637-640)

      “... In addition to technical limitations, given the apparently limited size of the sample (total N = 46), it warrants caution in generalizing these findings elsewhere, and necessitates further validations in a large-scale cohort.”

      (5) It remains somewhat ambiguous whether the sham group had the same number of stimulation sessions as the verum stimulation group; please clarify: Did both groups come in the same number of times into the lab? I.e., were all procedures identical except whether the stimulation was verum or sham?

      Yes, we fully followed the CONSORT pipeline to carry out this double-blind trial, and thus confirmed that all the participants in both groups had the same number of stimulation sessions in our lab. That is to say, except for the stimulation type (verum vs sham), all the procedures, equipment and even the room were identical for all the participants. For clarification, we have clearly stated this in the main text:

      Results Section (Page 9, Line 419-423)

      “In both groups, almost all participants (93.2%, 41/44) reported perceiving acceptable pain stemming from current stimulation, and believed they were receiving treatment (91.30% (21/23) for active neuromodulation group (NM), 86.95% (20/23) for sham control group (SC), x<sup>2</sup> = 0.224, p = .636). All the participants were engaged in the identical experimental procedures excepting to stimulation’s type (active vs sham). ...”

      (6) The TDM analysis and hyperbolic discounting approach were unclear to me; this needs to be described in more detail, otherwise it cannot be evaluated.

      We apologize for the inadequate details, which hindered a precise understanding of the TDM and the hyperbolic discounting model. The Temporal Decision Model (TDM) was originally proposed by our team (Xu et al., 2023; Zhang et al., 2019, 2020, 2021), which theoretically conceptualizes procrastination as the failure of trade-off between task outcome value (i.e., motivation to take actions now for pursuing task reward) and task aversiveness (i.e., motivations to take away from playing actions now for avoiding negative experiences). Once task aversiveness overrides the pursuit of task outcome values, the procrastination emerges. One overarching hypothesis in this theoretical model is that the task aversiveness is hyperbolically discounted when approaching the deadline: it would be discounted sharply when far from the deadline but discounted slowly when nearing the deadline (Zhang et al., 2019). Considering the nonlinear dynamics inherent in this hyperbolic discounting, we therefore employed a log-spaced temporal sampling scheme (Myerson et al., 2001) to strengthen curve-fitting performance (please see the schematic diagram (https://uen.pressbooks.pub/behavioraleconomics/chapter/the-reality-of-homo-sapiens, where each point indicates a sampling time)):

      Specifically, based on the log-spaced temporal sampling rule, five time points were first selected to fulfill the statistical prerequisites for hyperbolic model fitting, with increasing sampling density toward the deadline (e.g., for a task due at 20:00: sampling occurred at 10:00, 16:00, 18:00, 19:30, 20:00). At each time point, participants reported task aversiveness (A) on a 0–100 Visual Analog Scale (VAS). Then, task aversiveness discounting was calculated as 1- (A<sub>t</sub> / A<sub>earliest</sub>), where t<sub>earliest</sub> was the earliest sampling point (e.g., 10:00), serving as the reference for immediate execution. Subsequently, using the GraphPad Prisma software (v9, 525), we estimated the AUC from these five data points based on the Myerson algorithm (Myerson et al., 2001), which was computed as the trapezoidal integration of task aversiveness discounting over time. By this modelling method, a higher AUC reflects stronger temporal discounting of task aversiveness, which means that participants experience a faster decline in subjective aversiveness as execution is delayed, yielding lower effective aversiveness and reduced avoidance behavior. That is to say, if a participant showcases a greater discounting of task aversiveness as reflected by a higher AUC, she/he experiences a more pronounced reduction in subjective aversiveness upon postponement, plausibly yielding less procrastination. As you kindly suggested, we have added these details to explicitly clarify how to use the hyperbolic discounting approach for determining sampling time points and for calculating AUC of task aversiveness discounting.

      Methods Section (Page 6, Line 268-283)

      “On the Task day, we developed a mobile app to implement experience sampling method (ESM) for tracking one’s real-time evaluation of task aversiveness and task outcome value (see Fig. 1). The task aversiveness describes how disagreeable one perceives when performing a given real-life task to be, whereas outcome value refers to the subjective benefits of the task outcome brought about by completing the task before the deadline (Zhang & Feng, 2020). As theoretically conceptualized by the temporal decision model (TDM) of procrastination, the perceived task aversiveness is hyperbolically discounted when approaching deadline, showing sharply discounting when faring away from deadline but slowly discounting once nearing deadline (Zhang & Feng, 2020; Zhang et al., 2021). Thus, considering this nonlinear dynamics inherent in this hyperbolic discounting, the five recording moments of ESM were selected per task a priori by using a log-spaced temporal sampling scheme (Myerson et al., 2001), with increasing sampling density toward the deadline, such as moments of 10:00 (earliest), 16:00, 18:00, 19:30, 20:00 (deadline). The five sampling points could meet statistical prerequisite in the hyperbolic model fitting, requiring ≥ 4 points (Green & Myerson, 2004). To do so, recording moments of tasks were individually tailored for each task per participant in this ESM procedure.”

      Methods Section (Page 7, Line 318-334)

      “... As articulated temporal decision theoretical model above, the task aversiveness evoked by executing a task was temporally dynamic in a hyperbolic discounting pattern, with sharply discounting in faring away from deadline but slowly discounting in nearing deadline (Zhang & Feng, 2020). To quantitatively characterize the task aversiveness with consideration for its dynamics, the model-free area under the curve (AUC) was calculated. Specifically, based on the log-spaced temporal sampling rule, task aversiveness was measured by 100-point visual analog scale at the five sampling moments. Then, the task aversiveness discounting (A) was calculated as 1- (A(t) / A(earliest)), where t(earliest) was the earliest sampling point, serving as the reference for immediate execution. Subsequently, using the GraphPad Prisma software (v9, 525), the AUC was computed as the trapezoidal integration between task aversiveness discounting and time across five data points, basing on the Myerson algorithm (Myerson et al., 2001). By doing so, a higher AUC reflects stronger temporal discounting of task aversiveness along with nearing deadline, which means that participants experience a faster decline in subjective aversiveness as execution is delayed, yielding lower effective aversiveness and reduced avoidance behavior. As for the task outcome value, it was theoretically posited as a relatively stable evaluation of the task (Zhang & Feng, 2020; Zhang et al., 2021).”

      References

      Myerson, J., Green, L., & Warusawitharana, M. (2001). Area under the curve as a measure of discounting. Journal of the experimental analysis of behavior, 76(2), 235–243. https://doi.org/10.1901/jeab.2001.76-235

      Xu, T., Zhang, S., Zhou, F., & Feng, T. (2023). Stimulation of left dorsolateral prefrontal cortex enhances willingness for task completion by amplifying task outcome value. Journal of experimental psychology. General, 152(4), 1122–1133. https://doi.org/10.1037/xge0001312

      Zhang, S., Verguts, T., Zhang, C., Feng, P., Chen, Q., & Feng, T. (2021). Outcome Value and Task Aversiveness Impact Task Procrastination through Separate Neural Pathways. Cerebral cortex (New York, N.Y. : 1991), 31(8), 3846–3855. https://doi.org/10.1093/cercor/bhab053

      Zhang, S., Liu, P., & Feng, T. (2019). To do it now or later: The cognitive mechanisms and neural substrates underlying procrastination. Wiley interdisciplinary reviews. Cognitive science, 10(4), e1492. https://doi.org/10.1002/wcs.1492

      Zhang, S., & Feng, T. (2020). Modeling procrastination: Asymmetric decisions to act between the present and the future. Journal of experimental psychology. General, 149(2), 311–322. https://doi.org/10.1037/xge0000643

      (7) Coming back to the point about the statistical analyses not being described in enough detail: One important example of this is the inclusion of random slopes in their mixed-effects model which is unclear. This is highly relevant as omission of random slopes has been repeatedly shown that it can lead to extremely inflated Type 1 errors (e.g., inflating Type 1 errors by a factor of then, e.g., a significant p value of .05 might be obtained when the true p value is .5). Thus, if indeed random slopes have been omitted, then it is possible that significant effects are significant only due to inflated Type 1 error. Without more information about the models, this cannot be ruled out.

      Thank you for sharing this very timely and crucial comment. After careful scrutiny, we identified this statistical flaw you pointed out - each participant was not yet modeled as random slopes but as random intercepts merely. As you kindly suggested, we have reanalyzed all the statistics by adding random slopes (i.e., (1 + day|SubjectID)). Results showed a statistically significant interaction effect for both procrastination willingness (β = -7.8, SE = 1.8, DF = 45.6, p < .001) and actual procrastination rates (β = -7.4, SE = 2.4, DF = 46.6, p = .004), indicating the effectiveness of multi-session neuromodulation in mitigating procrastination. In the post-hoc simple effect analyses, participants who engaged in active neuromodulation (NM) showed a significant increase in task-execution willingness (i.e., decreased procrastination willingness; NM-before: 35.65 ± 30.20, NM-after: 80.43 ± 19.92, t.ratio = 5.4, p < .0001, Tukey correction) and a decrease in actual procrastination rates (NM-before: 43.26 ± 39.09, NM-after: 0.00 ± 0.00, t.ratio = 5.1, p < .0001, Tukey correction), while no such effects were identified for participants in the sham control group (for willingness, SC-before: 37.57 ± 26.46, SC-after: 47.35 ± 30.49, t.ratio =0.3, p = .77, Tukey correction; for actual procrastination, SC-before: 46.47 ± 40.75, SC-after: 33.34 ± 37.82, t.ratio = 0.7, p = .48, Tukey correction). Taken together, we do appreciate your pointing out this definitely crucial statistical weakness, and have confirmed that our findings remain reliable after adjusting for Type 1 error by adding random slopes. Moreover, as you kindly suggested, we have incorporated these statistical details, particularly those concerning the GLMM, into the main text to facilitate your evaluation. Please see specific revisions below:

      Methods Section (Page 8, Line 381-401)

      “To clarify whether multiple-session HD-tDCS neuromodulation can reduce procrastination, the generalized mixed-effects linear model (GLMM) was constructed with full factorial design for subjective procrastination willingness (i.e., self-reported visual analog scores) and actual procrastination behavior (i.e., real-world task-completion rate before deadline). Here, sex, age and socioeconomic status (SES) were modeled as covariates of no interest. As the National Bureau of Statistics (China) issued (https://www.stats.gov.cn/sj/tjbz/gjtjbz/), on the basis of per capita annual household income, the SES was divided into seven hierarchical tiers from 1 (poor) to 7 (rich). To obviate subjective rating bias stemming from individual daily mood, we separately measured participants’ daily emotional fluctuation at 10:00 and 16:00 using a self-rating visual analog item (i.e., “How do feel for your mood today?”, 0 for “completely uncomfortable” and 100 for “definitely happy”). By doing so, the averaged score of those self-rating emotions at the two time points was modeled into the GLMM as covariate of no interests, yielding the final expression of “outcome ~ Group*Treatment_Day + Age + Gender + SES + Emotions + (1 + Treatment_Day | SubjectID)” in the statistical model”. This analysis was implemented using the “lme4” and “lmerTest” packages. Employing “emmeans” package, simple effects were also tested at baseline and post-last-intervention using Tukey-adjusted pairwise comparisons of estimated marginal means from the full GLMM, controlling for covariates and random-effects structure. To validate statistical robustness, instead of continuous outcomes for parametric tests, we also conducted a between-group comparison for the number of tasks that procrastination emerges by using the nonparametric x<sup>2</sup> test with φ correction or Fisher exact test....”

      Results Section (Page 9, Line 428-449)

      “To identify whether ms-tDCS targeting the left DLPFC can alleviate subjective procrastination willingness and actual procrastination behavior, a generalized linear mixed-effects model with Scatterthwaite algorithm was built, with task-execution willingness and actual procrastination rates (PR) as primary outcomes, respectively. For procrastination willingness, results showed a statistically significant interaction effect between multi-session neuromodulations and groups (β = -7.8, SE = 1.8, DF = 45.6, p < .001; Fig. 3A). In the post-hoc simple effect analysis, it demonstrated a significantly increased task-execution willingness (i.e., decreased procrastination willingness) after neuromodulation in the active neuromodulation group (NM-before: 35.65 ± 30.20, NM-after: 80.43 ± 19.92, t.ratio = 5.4, p < .0001, Tukey correction), but no such effects were identified in the sham control group (SC-before: 37.57 ± 26.46, SC-after: 47.35 ± 30.49, t.ratio =0.3, p = .77, Tukey correction) (Fig. 3B-C). A linear uptrend for task-execution willingness was further observed across multiple sessions in the active NM group, indicating gradually increasing neuromodulation effects (Fig. 3D; p < .01, Mann-Kendall test). For actual procrastination behavior, changes to actual procrastination rates across all the sessions have been detailed in the Fig. 3E. Similarly, a statistically significant interaction effect was identified here (β = -7.4, SE = 2.4, DF = 46.6, p = .004), and the simple effect analysis further revealed decreased actual procrastination rates after ms-tDCS in the active neuromodulation group (NM-before: 43.26 ± 39.09, NM-after: 0.00 ± 0.00, t.ratio = 5.1, p < .0001, Tukey correction), but no such prominent changes found in the sham control group (SC-before: 46.47 ± 40.75, SC-after: 33.34 ± 37.82, t.ratio = 0.7, p = .48, Tukey correction) (Fig. 3F-G). Also, a significant downtrend for procrastination rates across all the sessions was identified in the active NM group (Fig. 3H; p < .01, Mann-Kendall test).”

      (8) Related to the previous point: The authors report, for example, on the first results page, line 420, an F-test as F(1, 269). This means the test has 269 residual degrees of freedom despite a sample size of about 50 participants. This likely suggests that relevant random slopes for this test were omitted, meaning that this statistical test likely suffers from inflated Type 1 error, and the reported p-value < .001 might be severely inflated. If that is the case, each observation was treated as independent instead of accounting for the nestedness of data within participants. The authors should check this carefully for this and all other statistical tests using mixed-effects models.

      Thank you for underlining this very timely and helpful comment. As you correctly pointed out above, we did not include random slopes in the original GLMM, highly risking the inflation of the false-positive rate (i.e., Type-I error). By adding the random slopes, we reanalyzed all the statistics from the GLMM, and confirmed that all the findings are still reliable from those new GLMMs with random slopes. Again, thank you for this crucial statistical advice, and please see the above response for full details regarding what we have revised to address this comment you kindly raised.

      (9) Many of the statistical procedures seem quite complex and hard to follow. If the results are indeed so robust as they are presented to be, would it make sense to use simpler analysis approaches (perhaps in addition to the complex ones) that are easier for the average reader to understand and comprehend?

      We do thank you for this practical and helpful comment. In the original manuscript, we incorporated a joint model of longitudinal and survival data (JM-LSD), in conjunction with machine learning algorithms, to strengthen the robustness of our statistical findings. Nevertheless, we all agree with you on this point: there is no need to complicate the analyses by repeatedly probing the same research question to increase methodological robustness, at the expense of compromising readability and intelligibility for a broader audience. As you suggested, we have removed these complicated statistical methods, and merely maintained the primary ones - GLMM and X<sup>2</sup> cross-tab test, as well as a complementary one - Mann-Kendall linear trend test. Thus, we have almost rewritten the whole Results section. Please see the specific instances below:

      Results Section (Page 9, Line 468-485)

      “Ms-tDCS changes task aversiveness and task-outcome value

      Both task aversiveness and task outcome value serve as key pathways determining whether one would procrastinate. To this end, we further utilized a generalized linear mixed-effects model to examine the effects of ms-tDCS on changes in task aversiveness and task outcome value. Task aversiveness changes across all the sessions are shown in the Fig. 4A and 4C. We demonstrated a statistically significant decrease in task aversiveness and an increase in task outcome value via ms-tDCS in the neuromodulation group (Task aversiveness: interaction effect, β = -0.12, SE = 0.04, DF = 46.7, p = .002; simple effect, NM-before <sub>(AUC)</sub>: 1.13 ± 0.53, NM-after <sub>(AUC)</sub>: 1.95 ± 0.85, t.ratio = 4.5, p < .001, Tukey correction; Outcome value: β = -6.8, SE = 1.74, DF = 46.2, p < .001; simple effect, NM-before: 35.86 ± 27.82, NM-after: 73.08 ± 23.33, t.ratio = 5.0, p < .001, Tukey correction; see Fig. 4B), but not in the sham control group (Task aversiveness: SC-before <sub>(AUC)</sub>: 1.07 ± 0.51, SC-after <sub>(AUC)</sub>: 1.28 ± 0.46, t.ratio = 1.3, p = .20, Tukey correction; Outcome value: SC-before: 34.00 ± 25.17, SC-after: 40.13 ± 28.94, t.ratio = 0.8, p = .41, Tukey correction; see Fig. 4D). In the neuromodulation (NM) group, task aversiveness steadily decreased with the cumulative number of stimulation sessions, while perceived task outcome value increased significantly (see Fig. 4E-F, p < .05, Mann-Kendall test). Thus, it provides causal evidence clarifying that neuromodulation to left DLPFC reduces task aversiveness and enhances task-outcome value meanwhile.”

      Results Section (Page 10, Line 525-542)

      “Long-term effects of ms-tDCS

      We have also attempted to conduct a follow-up investigation to test the long-term retention of ms-tDCS in reducing actual procrastination. Almost all the participants had undergone follow-up except one in the neuromodulation group after last neuromodulation for 6 months (N<sub>NM</sub> = 22, N<sub>SC</sub> = 23). Thus, the GLMM was constructed, with the PR before first neuromodulation vs. PR after last neuromodulation for 6 months as covariates of interest. Results showed the statistically significant group*time interaction effects (β = 16.5, SE = 9.9, p = .049). Simple-effect model demonstrated a decrease in actual procrastination rates in the active neuromodulation group after last stimulation for 6 months compared to baseline (β = -22.05, SE = 10.0, p = .038, Tukey correction; NM-before: 40.68 ± 37.96, NM-after<sub>6-months</sub>: 18.63 ± 29.80), and revealed null effects in the SC group (β = 1.26, SE = 9.78, p = .99, Tukey correction; SC-before: 46.47 ± 40.75, SC-after<sub>6-months</sub>: 47.73 ± 39.18) (see Fig. 6).. Furthermore, using a nonparametric x<sup>2</sup> test to compare differences in the number of procrastinated tasks, we still found a statistically significant reduction in procrastination frequency in NM group after neuromodulation for 6 months compared to baseline (x<sup>2</sup> = 3.30, p = .035, NM-before: 68.19% (15/22), NM-after<sub>6-months</sub>: 40.91% (9/22)), while no significant changes were observed in the SC group (x<sup>2</sup> = 0.11, p = .74, SC-before: 69.56% (16/23), SC-after<sub>6-months</sub>: 73.91% (17/23)). Therefore, beyond to short-term effects, the benefits of ms-tDCS neuromodulation to reduce procrastination pose the long-term retention.”

      (10) As was noted by an earlier reviewer, the paper reports nearly exclusively about the role of the left DLPFC, while there is also work that demonstrates the role of the right DLPFC in self-control. A more balanced presentation of the relevant scientific literature would be desirable.

      We are grateful to you for noticing the unbalanced presentation of the literature on left DLPFC. As you kindly suggested, we have added literature to support the association between self-control and the right lateralization of the DLPFC. Please see below for what we have revised:

      Introduction Section (Page 4, Line 137-143)

      “...In addition to the left lateralization, there is solid evidence indicating significant associations between self-control and the right DLPFC indeed, particularly given that this region specifically functions in top-down regulation, future self-continuity representation and social decisions (Huang et al., 2025; Lin and Feng, 2024; Knoch & Fehr, 2007). Despite this case, Xu and colleagues demonstrated null effects of anodally stimulating the right DPFC to modulate either value evaluation or emotional regulation for changing procrastination willingness (Xu et al., 2023).”

      (11) Active stimulation reduced procrastination, reduced task aversiveness, and increased the outcome value. If I am not mistaken, the authors claim based on these results that the brain stimulation effect operates via self-control, but - unless I missed it - the authors do not have any direct evidence (such as measures or specific task measures) that actually capture self-control. Thus, that self-control is involved seems speculation, but there is no empirical evidence for this; or am I mistaken about this? If that is indeed correct, I think it needs to be made explicit that it is an untested assumption (which might be very plausible, but it is still in the current study not empirically tested) that self-control plays any role in the reported results.

      We truly appreciate your pointing out this weakness with regard to conceptualization. Yes, you are correct in understanding this causal chain: we conceptually speculate that the HD-tDCS stimulation over the left DLPFC operates self-control to change procrastination, rather than empirically validating this component in the chain: brain stimulation→increased self-control→increased task outcome value→decreased procrastination. In this causal chain, we did not collect data to directly measure self-control at either baseline or post-neuromodulation times. Therefore, we all agree with your suggestion to explicitly claim this case in the main text. Following this advice, we have redrawn a portion of the Conclusion by clearly pointing out the hypothesis-generating role of self-control in mitigating procrastination, and have further claimed this case in the Limitation section:

      Abstract Section (Page 2, Line 55-57)

      “... This establishes a precise, value-driven neurocognitive pathway to account the conceptualized roles of self-control on procrastination, and offers a validated, theory-driven strategy for interventions.”

      Results Section (Page 10, Line 489-492 and 520-522)

      “Given the dual neurocognitive pathways identified above—reduced task aversiveness and increased task-outcome value—we proposed that these changes, conceptually driven by enhanced self-control via ms-tDCS over left DLPFC, account for how neuromodulation reduces procrastination. ...”

      “In summary, these findings demonstrated a mechanistic pathway underlying procrastination: the self-control that was conceptualized to be governed by left DLPFC mitigate procrastination by plausibly increasing task-outcome value.”

      Discussion Section (Page 13, Line 642-645)

      “Moreover, this study did not collect data for assessing participants’ self-control at either baseline or post-neuromodulation, thereby limiting our ability to determine whether the effects on procrastination were uniquely attributable to neuromodulation-induced changes in self-control. ...”

      (12) Figures 3F and 3H show that procrastination rates in the active modulation group go to 0 in all participants by sessions 6 and 7. This seems surprising and, to be honest, rather unlikely that there is absolutely no individual variation in this group anymore. In any case, this is quite extraordinary and should be explicitly discussed, if this is indeed correct: What might be the reasons that this is such an extreme pattern? Just a random fluctuation? Are the results robust if these extreme cells are ignored? The authors remove other cells in their design due to unusual patterns, so perhaps the same should be done here, at least as a robustness check.

      Thank you for raising this highly important and helpful comment. Indeed, we fully understand that this result is somewhat extraordinary, a fact that was equally striking to us when unblinding the data. After carefully scrutinizing the data and statistics, we are thrilled to confirm that this pattern is true. In support of this observation, we were gratified to receive numerous thank-you letters from participants who engaged in active neuromodulation. They expressed gratitude to us, and reported that they have substantially ameliorated procrastination behavior in real-life activities after completing the trial. While this does not constitute formal scientific evidence, we are also glad to see the benefits of this neuromodulation for those procrastinators.

      Two reasons could account for this pattern herein. One interpretation is to attribute this pattern to “scalar inflation”. In the present study, the procrastination rate was calculated as 1 minus the task-completion rate (e.g., 80%, 60%, 40%) by the deadline. At sessions # 6 and #7, all the participants completed their real-life tasks before the deadline, yielding a 0% (1 minus 100% completion rate) procrastination rate, without any between-individual variation. Thus, rather than there being no individual variation in procrastination, this scalar – the procrastination rate - is too insensitive to capture subtle differences per se. For instance, although participants #1 and #2 both showed a 0% procrastination rate - meaning that both completed their tasks before the deadline - Participant #1 might have completed it 3 hours before the deadline, whereas Participant #2 might have completed it only 10 minutes before. In this case, the “scalar inflation” emerges to let us perceive that both participants have equivalent procrastination rates, although participant #2 may have a higher procrastination level than #1. As conceptually defined in the field, procrastination is contextualized as “not completing a task before the deadline”. Thus, if this task is completed before the deadline, regardless of whether it was finished close to or far in advance of the deadline, this case is defined as “no procrastination”. In the present study, the primary outcome is whether a participant procrastinated on a real-life task before the deadline in real-world settings, irrespective of when she/he completed this task. Thus, this scalar - procrastination rate - fits our conceptualization of procrastination.

      Another reason is the potential accumulative effects from sequential multi-session tDCS stimulation. As shown in Mann-Kendall trend tests, the procrastination rates show a significant linear downtrend in the active neuromodulation group across sessions, even after removing sessions #6 and #7. This indicates that the improvements of going against procrastination may be sequentially accumulative along with the increase in sessions, implying a potential “dose-dependent effect”. Despite a speculative interpretation, this “dose-dependent effect” in neuromodulation has been well-documented in previous studies, showing the robustly linear association between the number of sessions and effectiveness (c.f., Cole et al., 2020; Hutton et al., 2023; Sabé et al., 2024; Schulze et al., 2018). Therefore, although this extreme pattern is somewhat extraordinary compared to previous observations, it makes sense.

      Yes, this is a definitely great idea to carry out a robustness check by removing sessions #6, #7, or both. We do believe that this analysis could support statistical robustness to go against potential biases from extreme cells. By doing so, we found that all the group*treatment_day interaction effects remained significant when removing either session #6 or session #7 (or even both, all p-values < .05), indicating high statistical robustness. Please see Supplementary table S3 and S4

      Taken together, in spite of their being extraordinary, we confirm that those findings are statistically robust to extreme outliers. As you kindly suggested, we have added those findings of the robustness check into the revised Supplemental Materials section.

      References

      Cole, E. J., Stimpson, K. H., Bentzley, B. S., Gulser, M., Cherian, K., Tischler, C., Nejad, R., Pankow, H., Choi, E., Aaron, H., Espil, F. M., Pannu, J., Xiao, X., Duvio, D., Solvason, H. B., Hawkins, J., Guerra, A., Jo, B., Raj, K. S., Phillips, A. L., … Williams, N. R. (2020). Stanford Accelerated Intelligent Neuromodulation Therapy for Treatment-Resistant Depression. The American journal of psychiatry, 177(8), 716–726. https://doi.org/10.1176/appi.ajp.2019.19070720

      Hutton, T. M., Aaronson, S. T., Carpenter, L. L., Pages, K., Krantz, D., Lucas, L., Chen, B., & Sackeim, H. A. (2023). Dosing transcranial magnetic stimulation in major depressive disorder: Relations between number of treatment sessions and effectiveness in a large patient registry. Brain stimulation, 16(5), 1510–1521. https://doi.org/10.1016/j.brs.2023.10.001

      Sabé, M., Hyde, J., Cramer, C., Eberhard, A., Crippa, A., Brunoni, A. R., Aleman, A., Kaiser, S., Baldwin, D. S., Garner, M., Sentissi, O., Fiedorowicz, J. G., Brandt, V., Cortese, S., & Solmi, M. (2024). Transcranial Magnetic Stimulation and Transcranial Direct Current Stimulation Across Mental Disorders: A Systematic Review and Dose-Response Meta-Analysis. JAMA network open, 7(5), e2412616. https://doi.org/10.1001/jamanetworkopen.2024.12616

      Schulze, L., Feffer, K., Lozano, C., Giacobbe, P., Daskalakis, Z. J., Blumberger, D. M., & Downar, J. (2018). Number of pulses or number of sessions? An open-label study of trajectories of improvement for once-vs. twice-daily dorsomedial prefrontal rTMS in major depression. Brain stimulation, 11(2), 327–336. https://doi.org/10.1016/j.brs.2017.11.002

      (13) The supplemental materials, unfortunately, do not give more information, which would be needed to understand the analyses the authors actually conducted. I had hoped I would find the missing information there, but it's not there.

      Sorry to offer uninformative supplemental materials (SM) in the original submission. As you suggested, we have added a substantial number of details to clarify how we conducted data analyses in the main text, and also tightened the whole SM section to improve readability and comprehensibility. We do hope that this revised manuscript could offer clear and adequate information in understanding methods and statistics for broader readers.

      In sum, the reported/cited/discussed literature gives the impression of being incomplete/selectively reported; the analyses are not reported sufficiently transparently/fully to evaluate whether they are appropriate and thus whether the results are trustworthy or not. At least some of the patterns in the results seem highly unlikely (0 procrastination in the verum group in the last 2 observation periods), and the sample size seems very small for a between-subjects design.

      Thank you for this very helpful summary. As you kindly suggested above, we have overhauled this manuscript to address those points that you listed here, particularly where we added relevant literature to balance our claims, added a huge amount of details to sufficiently/transparently report statistics, and conducted a robustness check to confirm the statistical robustness of our findings to those plausible extreme patterns (sessions #6 and #7), as well as justified how we determined this sample size fulfilling medium statistical power in a priori. Please see above for full details regarding how we addressed those comments, point-by-point.

      Reviewer #2 (Public Review):

      Chen and colleagues conducted a cross-sectional longitudinal study, administering high-definition transcranial direct stimulation targeting the left DLPFC to examine the effect of HD-tDCS on real-world procrastination behavior. They find that seven sessions of active neuromodulation to the left DLPFC elicited greater modulation of procrastination measures (e.g., task-execution willingness, procrastination rates, task aversiveness, outcome value) relative to sham. They report that tDCS effects on task-execution willingness and procrastination are mediated by task outcome value and claim that this neuromodulatory intervention reduces procrastination rates quantified by their task. Although the study addresses an interesting question regarding the role of DLPFC on procrastination, concerns about the validity of the procrastination moderate enthusiasm for the study and limit the interpretability of the mechanism underlying the reported findings.

      Strengths:

      (1) This is a well-designed protocol with rigorous administration of high-definition transcranial direct current stimulation across multiple sessions. The approach is solid and aims to address an important question regarding the putative role of DLPFC in modulating chronic procrastination behavior.

      (2) The quantification of task aversiveness through AUC metrics is a clever approach to account for the temporal dynamics of task aversiveness, which is notoriously difficult to quantify.

      Thank you for taking your invaluable time to review our manuscript, warmly applauding the strength in research design and the conceptualization of scaling task aversiveness, as well as kindly sharing such helpful and insightful evaluations. As you correctly pointed out, we are aware of the absence of detailed, clear and understandable reporting of measures (e.g., real-world procrastination), statistics and methods, in the original manuscript. Following all your suggestions, we have thoroughly revised this manuscript to address those comments that you kindly made, point-by-point. Please see the full response underneath.

      Weaknesses:

      (1) The lack of specificity surrounding the "real-world measures" of procrastination is problematic and undermines the strength of the evidence surrounding the DLPFC effects on procrastination behavior. It would be helpful to detail what "real-world tasks" individuals reported, which would inform the efficacy of the intervention on procrastination performance across the diversity of tasks. It is also unclear when and how tasks were reported using the ESM procedure. Providing greater detail of these measures overall would enhance the paper's impact.

      We genuinely appreciate your raising this very crucial comment. We are sorry for omitting a tremendous number of methodological details to comply with the editorial requirement on the manuscript’s length, which hampered the comprehension of how we measure “real-life tasks” and “real-world procrastination”.

      As shown in the schematic diagram for experimental procedure (Fig. 1), the experimental protocol alternated between Neuromodulation Days (Days 2, 4, 6, 8, 10, 12, 14) and Task Days (Days 1, 3, 5, 7, 9, 11, 13, 15). On each Neuromodulation Day, participants received either active or sham HD-tDCS, and—critically—before stimulation—were instructed to specify a real-life task they were required to complete the following day, with a deadline between 18:00 and 24:00. This ensured ≥24 hours between neuromodulation and task execution, isolating offline after-effects. For instance, on Day #2 (Neuromodulation Day), before carrying out stimulation, participants were asked to report a real-life task that has a deadline within 18:00 - 24:00 for tomorrow’s “task day” (Day #3) (please see the schematic diagram in Author response image 2).

      Author response image 2.

      There are some real-life tasks that they reported in our experiment as examples: “Complete and submit a homework assignment”, “Complete a standardized English proficiency test”, “Complete an online course module required for applying a Class C driver’s license”, “Prepare slides for a seminar presentation”, “Practice guitar”, “Practice Chinese calligraphy”, and “Do the laundry”. Reported tasks spanned academic (e.g., submitting an assignment), occupational (e.g., preparing a presentation), administrative (e.g., applying for a license), self-improvement (e.g., practicing guitar for ≥30 min), domestic (e.g., laundry), and health-related domains (e.g., running ≥ 2,000m for exercise), indicating a plausible task diversity.

      On each “task day”, participants engaged in an intensive Experience Sampling Method (iESM) protocol via a custom-built mobile app. Using this app, participants were required to report a subjective task-execution willingness score (i.e., a one-item 100-point visual analog scale, “How willing are you to do this task?”, 0 for “I will definitely procrastinate this task” and 100 for “I will take action to complete this task immediately”; procrastination willingness = 100 – the task-execution willingness score), the subjective task aversiveness (i.e., a one-item 100-point visual analog scale), the subjective task outcome value (i.e., a one-item 100-point visual analog scale), and the objective procrastination rate, respectively.

      Rather than self-reported scores from those one-item visual analog scales, we asked participants to report real “task completion rate” for the objective quantification of the “real-world procrastination behavior”. Specifically, at the deadline, each participant was asked to report whether she/he had completed this task. If she/he reported not having yet completed the task (i.e. procrastination behavior emerged), she/he was further required to report the percentage of the task completed (1% - 99%), which was defined as the task completion rate. By doing so, we could calculate the real-world procrastination rate for the real-life task as the “1 – the task completion rate”. For instance, if a participant did not complete her/his real-life task before the deadline (i.e. she/he procrastinated this task) and reported completing 75% of this task at the deadline, her/his real-world procrastination rate was computed as the 25% (1 - 75%) (Please see the schematic diagram in Author response image 3).

      Moreover, rather than merely a self-reported task completion rate, each participant was also asked to upload proof (e.g., screenshots of submitted assignments, photos of printed documents, system timestamps) to the ESM digital system for validation.

      Author response image 3.

      To determine the sampling time points for this mobile app in the ESM, we capitalized on both the conceptual temporal decision model and the statistical Myerson algorithm. Specifically, the Temporal Decision Model (TDM) was originally proposed by our team (Xu et al., 2023; Zhang et al., 2019, 2020, 2021), which theoretically conceptualizes procrastination as the failure of the trade-off between task outcome value (i.e., motivation to take actions now for pursuing task reward) and task aversiveness (i.e., motivations for avoiding taking action now for avoiding negative experiences). Once task aversiveness overrides the pursuits of task outcome values, procrastination emerges. One overarching hypothesis in this theoretical model is that the task aversiveness is hyperbolically discounted when approaching the deadline: it would be discounted sharply when far from the deadline but discounted slowly when nearing the deadline (Zhang et al., 2019). To maximize statistical power to fit dynamic motivational curves, we employed a log-spaced temporal sampling scheme (Myerson et al., 2001) (please see the schematic diagram in https://uen.pressbooks.pub/behavioraleconomics/chapter/the-reality-of-homo-sapiens, where each point indicates a sampling time):

      By this fitting algorithm (Myerson et al., 2001), five time points were selected to fulfill the statistical prerequisites for hyperbolic model fitting, with increasing sampling density toward the deadline (e.g., for a task due at 20:00: sampled at 10:00, 16:00, 18:00, 19:30, 20:00). Once the task-specific five sampling time points were determined per participant, this mobile app sent a digital message to ask her/him to immediately report the task aversiveness and the task outcome value then. As the primary outcomes, the procrastination rate (i.e., 1 – the task completion rate) and the procrastination willingness were sampled at the deadline point.

      Furthermore, yes, we fully concur with you on this great idea, that is, transparency about task diversity strengthens the generalizability of our findings. In response, we have tabulated these real-life tasks that were reported in this experiment in the independent Appendix 1, with automatic translations from Chinese to English via Qwen GPT. Please see below for what we have added to the main text:

      Methods Section (Page 6-7, Line 238-308)

      “Nested cross-sectional longitudinal design

      This study used a nested cross-sectional longitudinal design to investigate whether the multiple-session anodal HD-tDCS targeting the left DLPFC could reduce actual procrastination behavior and to probe how this effect manifests. To assess procrastination in daily life, we implemented a 15-day protocol alternating between Neuromodulation Days (Days 2, 4, 6, 8, 10, 12, 14) and Task Days (Days 1, 3, 5, 7, 9, 11, 13, 15). On the Neuromodulation days, the 20-min anodal HD-tDCS neuromodulation targeting the left DLPFC was performed for HD-tDCS active group at intervals of 2 days, while the sham-control group received sham HD-tDCS training. This HD-tDCS training was repeated for a total of seven sessions, and lasted 15 days (see Fig. 1a). Crucially, to capture procrastination in ecologically valid contexts, prior to receiving either active or sham HD-tDCS (administered between 09:00–18:00), participants were instructed to specify a real-life task they were personally obligated to complete the following day, with a self-defined deadline strictly constrained to 18:00–24:00 to ensure ≥24 hours between stimulation offset and task deadline, thereby isolating offline after-effects. This task should meet the following three criteria: (a) it should be already assigned in the real-world settings; (b) deadline should be constrained to 18:00-24:00 (see above); (c) it should be more likely to induce procrastinate. By doing so, more than 300 real-life tasks were collected, spanning academic (e.g., “submit a statistics homework assignment”), occupational (e.g., “draft and email a project proposal”), administrative (e.g., “complete online application for Class C driver’s license”), self-improvement (e.g., “practice guitar for ≥30 minutes”), domestic (e.g., “do laundry ”), and health-related (e.g., “running 2,000m for exercise”). Full task list has been tabulated in the Appendix 1. As primary outcomes, all the participants were required to reported task-execution willingness (TEW) (Zhang & Feng, 2020; Zhang, Liu, et al., 2019), for a real-life task 24 hours post-neuromodulation. Thus, procrastination willingness was quantified as 100-TEW score (see underneath for details). Furthermore, we asked participants to report the actual task completion rate (CR) of the task at the deadline (e.g. participant A finished 90% homework at deadline and reported this situation to us at deadline). In this vein, the actual procrastination rate (PR) was quantified as 1-CR.

      On the Task day, we developed a mobile app to implement experience sampling method (ESM) for tracking one’s real-time evaluation of task aversiveness and task outcome value (see Fig. 1). The task aversiveness describes how disagreeable one perceives performing a given real-life task to be, whereas outcome value refers to the subjective benefits of the task outcome brought about by completing the task before the deadline (Zhang & Feng, 2020). As theoretically conceptualized by the temporal decision model (TDM) of procrastination, the perceived task aversiveness is hyperbolically discounted when approaching deadline, showing sharply discounting when faring away from deadline but slowly discounting once nearing deadline (Zhang & Feng, 2020; Zhang et al., 2021). Thus, considering this nonlinear dynamics inherent in this hyperbolic discounting, the five recording moments of ESM were selected per task a prior by using a log-spaced temporal sampling scheme (Myerson et al., 2001), with increasing sampling density toward the deadline, such as moments of 10:00 (earliest), 16:00, 18:00, 19:30, 20:00 (deadline). The five sampling points could meet statistical prerequisite in the hyperbolic model fitting (requiring ≥ 4 points; Green & Myerson, 2004). To do so, recording moments of tasks were individually tailored for each task per participant in this ESM procedure. To obviate the confounds of daily emotions in task aversiveness evaluation, we used the averaged scores of PANAS at 10:00 (noon) and 16:00 (afternoon) as anchoring points to quantify one’s daily emotions by using this ESM app. Before each session of HD-tDCS training, each participant was required to report a real-life task whose deadline is tomorrow. To obtain the long-term effect of HD-tDCS (i.e., the interval between HD-tDCS and task completion is at least 24 hours), the task deadline that participants reported was required to be between 18:00 - 24:00. Once a sampling time reached, this app would send a digital message to require participants to fill online form for data collection.

      Quantification of covariates of interests

      Outcome variables of this study were twofold: one is task-execution willingness and another is procrastination rate (PR). Task-execution willingness is used to evaluate one’s subjective inclination to avoid procrastination (Zhang & Feng, 2020). In this vein, we used a 100-point scale to require participants to report their task-execution willingness (0 for “I will definitely procrastinate this task” and 100 for “I will take action to complete this task immediately”). This metric was recorded 24 hours after neuromodulation to examine its long-term effects. PR is used to quantify the extent to which one task has been procrastinated, and was calculated as 1 - CR (task completion rate). Critically, at the precise deadline, the app prompted participants to (a) indicate task completion status (yes/no), and if incomplete, (b) report the percentage completed (1–99%), defined as the Task CR, while simultaneously uploading objective evidence (e.g., screenshots of submitted files, photos of physical outputs, system-generated logs, or app-exported records). If the task was actually completed before the deadline, the CR would be 100% and the PR would be calculated as 0% (1-CR). PR was recorded at the actual task deadline for each participant. We were also interested in re-investigating their actual procrastination by using PR 6 months after the last neuromodulation to test the long-term retention of this neuromodulation effect.”

      References

      Myerson, J., Green, L., & Warusawitharana, M. (2001). Area under the curve as a measure of discounting. Journal of the experimental analysis of behavior, 76(2), 235–243. https://doi.org/10.1901/jeab.2001.76-235

      Xu, T., Zhang, S., Zhou, F., & Feng, T. (2023). Stimulation of left dorsolateral prefrontal cortex enhances willingness for task completion by amplifying task outcome value. Journal of experimental psychology. General, 152(4), 1122–1133. https://doi.org/10.1037/xge0001312

      Zhang, S., Verguts, T., Zhang, C., Feng, P., Chen, Q., & Feng, T. (2021). Outcome Value and Task Aversiveness Impact Task Procrastination through Separate Neural Pathways. Cerebral cortex (New York, N.Y. : 1991), 31(8), 3846–3855. https://doi.org/10.1093/cercor/bhab053

      Zhang, S., Liu, P., & Feng, T. (2019). To do it now or later: The cognitive mechanisms and neural substrates underlying procrastination. Wiley interdisciplinary reviews. Cognitive science, 10(4), e1492. https://doi.org/10.1002/wcs.1492

      Zhang, S., & Feng, T. (2020). Modeling procrastination: Asymmetric decisions to act between the present and the future. Journal of experimental psychology. General, 149(2), 311–322. https://doi.org/10.1037/xge0000643

      (2) Additionally, it is unclear whether the reported effects could be due to differential reporting of tasks (e.g., it could be that participants learned across sessions to report more achievable or less aversive task goals, rather than stimulation of DLPFC reducing procrastination per se). It would be helpful to demonstrate whether these self-reported tasks are consistent across sessions and similar in difficulty within each participant, which would strengthen the claims regarding the intervention.

      Thank you for raising this very crucial comment. We indeed agree with you on this point that the reported effects may vary with task difficulties and task-execution proficiency, which potentially confound the effects of stimulation on mitigating procrastination. As you correctly comment, given no data collection on difficulties or other relevant characteristics of tasks, we cannot completely rule out this confounder in interpreting our findings on the one hand. As a result, we have explicitly claimed this limitation in the Discussion section.

      On the other hand, despite no quantitative evidence, this risk of confounding main effects with disparities in task characteristics was controlled experimentally. As we reported above, all the reported tasks were mandated to meet three criteria: (a) they were already assigned in the real-world settings; (b) the deadline was constrained to 18:00-24:00; (3) they were likely to lead to procrastinate. To do so, each participant was clearly instructed to report a real-life task that was more likely to be procrastinated in real-world settings, and was not allowed to report easy, achievable and cost-less tasks. Supporting this case, those reported tasks were found spanning academic (e.g., submitting an assignment), occupational (e.g., preparing a presentation), administrative (e.g., applying for a license), self-improvement (e.g., practicing guitar for ≥30 min), domestic (e.g., laundry), and health-related domains (e.g., running ≥ 2,000m for exercise), indicating a plausible task diversity and difficulty. This was resonated by observing the high within-subject task homogeneity. For instance, for Participant #5, she/he reported the tasks that were almost all around academic activities across all the sessions. Therefore, as the task list reported (please see Appendix 1), these self-reported tasks were plausibly consistent across sessions and similar in difficulty within each participant.

      In addition, as we tested, almost all the participants reported they were receiving treatment, with 91.30% (21/23) for the active neuromodulation group (NM) and with 86.95% (20/23) for the sham control group (SC) (x<sup>2</sup> = 0.224, p = .636), indicating the effectiveness of the double-blinding methods. If participants learned across sessions to report more achievable or less aversive task goals, their procrastination willingness and procrastination rates for their reported tasks would all increasingly decrease, irrespective of whether they were in the active neuromodulation-effect group or the sham group. However, no such effects - procrastination willingness and procrastination rates for their reported tasks increasingly decreasing across sessions - existed in the sham control group (Mann-Kendall test, for procrastination willingness, tau = 0.60, p = .13; for procrastination rate, tau = 0.61, p = .13), indicating no statistically significant learning effect or strategic effect on task performance. Again, thank you for this very crucial comment, and we do hope these clarifications could address it.

      Limitations Section (Page 12, Line 637-640)

      “In addition, despite instructing to report valid real-life tasks with high probabilities to procrastinate, we had not yet measured the task difficulty and consistency across sessions for each participant. Consequently, interpreting the effects of neuromodulation to mitigate procrastination as “unique contributions” should warrant cautions. ...”

      (3) It would be helpful to show evidence that the procrastination measures are valid and consistent, and detail how each of these measures was quantified and differed across sessions and by intervention. For instance, while the AUC metric is an innovative way to quantify the temporal dynamics of task-aversiveness, it was unclear how the timepoints were collected relative to the task deadline. It would be helpful to include greater detail on how these self-reported tasks and deadlines were determined and collected, which would clarify how these procrastination measures were quantified and varied across time.

      We do appreciate your highlighting the importance of clarifying how to measure procrastination, substantially helping readers to interpret these findings. As reported above, the primary outcomes of this experiment included subjective procrastination willingness and objective actual procrastination rate. For the subjective procrastination willingness, using the purpose-built mobile app, participants were required to report subjective task-execution willingness score (i.e., one-item 100-point visual analog scale, “How willing are you to do this task?”, 0 for “I will definitely procrastinate this task” and 100 for “I will take action to complete this task immediately”). Thus, the procrastination willingness was computed as “100 – the task-execution willingness score”. For the objective procrastination rate, rather than self-reported scores from those one-item visual analog scales, we asked participants to report the real “task completion rate from 1% to 99%” for the objective quantification of the “real-world procrastination behavior”. Full details can be found in Response #1.

      For determining sampling time points for the quantification of AUC, we capitalized on both the conceptual Temporal Decision Model and the statistical Myerson algorithm. Specifically, the Temporal Decision Model (TDM) was originally proposed by our team (Xu et al., 2023; Zhang et al., 2019, 2020, 2021), which theoretically conceptualizes procrastination as the failure of the trade-off between task outcome value (i.e., motivation to take actions now for pursuing task reward) and task aversiveness (i.e., motivations for avoiding taking action now for avoiding negative experiences). Once task aversiveness overrides the pursuits of task outcome values, the procrastination emerges. One overarching hypothesis in this theoretical model is that the task aversiveness is hyperbolically discounted when approaching the deadline: it would be discounted sharply when being far from the deadline but discounted slowly when nearing the deadline (Zhang et al., 2019). To maximize statistical power to fit dynamic motivational curves, we employed a log-spaced temporal sampling scheme (Myerson et al., 2001). By this fitting algorithm (Myerson et al., 2001), five time points were selected to fulfill the statistical prerequisites for hyperbolic model fitting, with increasing sampling density toward the deadline (e.g., for a task due at 20:00: sampled at 10:00, 16:00, 18:00, 19:30, 20:00).

      Once the task-specific five sampling time points were determined per participant, this mobile app sent a digital message to ask her/him to immediately report the task aversiveness and the task outcome value then. After capturing the task aversiveness from those five time points, the task aversiveness discounting was calculated as 1- (A(t) / A(earliest)), where t(earliest) was the earliest sampling point (e.g., 10:00), serving as the reference for immediate execution. Subsequently, using the GraphPad Prisma software (v9, 525), we estimated the AUC from those five data points based on the Myerson algorithm (Myerson et al., 2001), which was computed via the trapezoidal integration between task aversiveness discounting and time. By this modelling method, a higher AUC reflects stronger temporal discounting of task aversiveness, which means that participants experience a faster decline in subjective aversiveness as execution is delayed, yielding lower effective aversiveness and reduced avoidance behavior. That is to say, if a participant showcases a greater discounting of task aversiveness as reflected by a higher AUC, she/he experiences a more pronounced reduction in subjective aversiveness upon postponement, plausibly yielding less procrastination.

      Taken together, following your suggestion, we have added a substantial number of details to clarify how to measure procrastination, when to sample the data and how to estimate the AUC into the revised manuscript. Please see them in Response #1.

      (4) There are strong claims about the multi-session neuromodulation alleviating chronic procrastination, which should be moderated, given the concerns regarding how procrastination was quantified. It would also be helpful to clarify whether DLPFC stimulation modulates subjective measures of procrastination, or alternatively, whether these effects could be driven by improved working memory or attention to the reported tasks. In general, more work is needed to clarify whether the targeted mechanisms are specific to procrastination and/or to rule out alternative explanations.

      Yes, we fully agree with you on this consideration: we should tone down the conclusions currently claimed in the main text, given the inherent shortcomings mentioned above. As you helpfully suggested, we have moderated our overall claims regarding the effects of multi-session neuromodulation in alleviating chronic procrastination. Please see specific instances below:

      Abstract Section (Page 2, Line 55-57)

      “... This establishes a precise, value-driven neurocognitive pathway to account the conceptualized roles of self-control on procrastination, and potentially offers a validated, theory-driven strategy for interventions.”

      Conclusion Section (Page 13, Line 657-664)

      “In conclusion, this study potentially provides an effective way to reduce both procrastination willingness and actual procrastination behavior by using neuromodulation on the left DLPFC. Furthermore, such effects have been observed for 2-day-interval long-term after-effects, and were also found for 6-month long-term retention in part. More importantly, this study identified that the ms-tDCS neuromodulation could decrease task aversiveness and increase task outcome value while, and further demonstrated that the increased task outcome value could predict decreased procrastination, a relationship conceptually driven by enhancing self-control. In this vein, the current study enriches our understanding of neurocognitive mechanism of procrastination by showing the prominent role of increased task outcome value in reducing procrastination. Also, it may provide an effective method for intervening in human procrastination.”

      Moreover, yes, as we clarified above, in addition to the objective measure of procrastination behavior, we also leveraged a one-item visual analog scale (i.e. one-item 100-point visual analog scale, “How willing are you to do this task?”, 0 for “I will definitely procrastinate this task” and 100 for “I will take action to complete this task immediately”) to measure subjective procrastination willingness. Results demonstrated that the subjective procrastination willingness significantly decreased across neuromodulation sessions in the active group, but not in the sham control group, consistent with the observed reduction in the objective procrastination measure. In addition, we all perceive it as helpful and crucial to note that we cannot draw the conclusion that the effects of neuromodulation on mitigating procrastination are contributed by increasing task outcome value uniquely. Given no measures or evidence of other factors, such as working memory and attention, we cannot rule out other neurocognitive pathways. To address this point, we have removed or rephrased such statements throughout the whole revised manuscript, and explicitly constrained to interpret this neurocognitive mechanism (i.e., increased task outcome value) within the theory-driven framework of the temporal decision model.

      Reviewer #3 (Public review):

      This manuscript explores whether high-definition transcranial direct current stimulation (HD-tDCS) of the left DLPFC can reduce real-world procrastination, as predicted by the Temporal Decision Model (TDM). The research question is interesting, and the topic - neuromodulation of self-regulatory behavior - is timely.

      Many thanks for kindly dedicating time to review our manuscript, and for the helpful comments detailed below. Thank you for appreciating the novelty of this study.

      However, the study also suffers from a limited sample size, and sometimes it was difficult to follow the statistics.

      Thank you for pointing out these crucial concerns. As you correctly raised, the sample size is somewhat small in any case, but we confirm that this sample size is adequate to obtain medium statistical power.

      For estimating the sample size, we determined the a priori effect size based on the existing work we published (Xu et al., 2023, J Exp Psychol Gen;152(4):1122-1133). In this pilot study, we identified a significant interaction effect between single-session tDCS stimulation (active vs sham) and time (pre-test vs post-test) (t = 2.38, p = .02, n = 27; 95% CI [0.14, 1.49]) for changing procrastination willingness in laboratory settings, indicating a medium effect size. Therefore, this pilot study provides supportive evidence to determine this effect size a priori.

      Using the GPower software with an estimation of a medium effect size, we determined that a total sample size of N<sub>total</sub> = 34 could reach adequate statistical power. Please see outputs of the GPower in Author response image 1.

      As for the statistics, we genuinely acknowledge that the vague methodological descriptions and complex algorithms indeed complicated the understanding of the methods and statistics. To address this, echoing the comment raised by Reviewer #1, we have removed the complicated statistics and methods, and further clarified how we used the generalized linear mixed-effect model (GLMM) for statistical analysis. Please see the specific revisions below:

      Methods Section (Page 8, Line 378-403)

      “Statistics

      All the statistics were implemented by R (https://www.rstudio.com/) and R-dependent packages.

      To clarify whether multiple-session HD-tDCS neuromodulation can reduce procrastination, the generalized mixed-effects linear model (GLMM) was constructed with full factorial design for subjective procrastination willingness (i.e., self-reported visual analog scores) and actual procrastination behavior (i.e., real-world task-completion rate before deadline). Here, sex, age and socioeconomic status (SES) were modeled as covariates of no interest. As the National Bureau of Statistics (China) issued (https://www.stats.gov.cn/sj/tjbz/gjtjbz/), on the basis of per capita annual household income, the SES was divided into seven hierarchical tiers from 1 (poor) to 7 (rich). To obviate subjective rating bias stemming from individual daily mood, we separately measured participants’ daily emotional fluctuation at 10:00 and 16:00 using a self-rating visual analog item (i.e., “How do feel for your mood today?”, 0 for “completely uncomfortable” and 100 for “definitely happy”). By doing so, the averaged score of those self-rating emotions at the two time points was modeled into the GLMM as covariate of no interests, yielding the final expression of “outcome ~ Group*Treatment_Day + Age + Gender + SES + Emotions + (1 + Treatment_Day | SubjectID)” in the statistical model”. This analysis was implemented using the “lme4” and “lmerTest” packages. Employing “emmeans” package, simple effects were also tested at baseline and post-last-intervention using Tukey-adjusted pairwise comparisons of estimated marginal means from the full GLMM, controlling for covariates and random-effects structure. To validate statistical robustness, instead of continuous outcomes for parametric tests, we also conducted a between-group comparison for the number of tasks that procrastination emerges by using the nonparametric x<sup>2</sup> test with φ correction or Fisher exact test. Regarding the 6-month follow-up investigation, this GLMM was also built to examine the long-term retention of neuromodulation on reducing actual procrastination.”

      The preregistration and ecological design (ESM) are commendable, but I was not able the find the preregistration, as reported in the paper.

      We are sorry to encounter a serious technical barrier that has rendered our preregistration invisible and inaccessible. The OSF has disabled my OSF account, as it claimed to detect “suspicious user’s activities” in my account. This has prevented access to all materials deposited in this OSF account, including this preregistration. We have contacted the OSF team, but received no valid technical solution to recover this preregistered report (please see the screenshot below). We reckon that this may be due to my affiliation change to the Third Military Medical University of People’s Liberation Army (PLA).

      To address this unexpected circumstance and to ensure transparency, we have explicitly reported this case in the main text, and added the “Reconstructed Preregistration Statement” to the Supplemental Materials (SM). Also, as it has been out of best practices in preregistration, in addition to transparently reporting this case, we have removed this statement regarding preregistration elsewhere throughout the revised manuscript.

      Overall, the paper requires substantial clarification and tightening.

      We are grateful for your evaluation, and we fully agree with you. In response, we have added a tremendous number of details to clarify how to measure procrastination, how to conduct the statistical analyses, and how to collect real-life tasks, as well as other experimental materials. Please see the revisions in the Methods section of the revised manuscript. Again, thank you for those helpful suggestions.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      (1) In the Supplemental Materials, page 4, lines 163 to 167 seem to be from a different manuscript (as the section talks about neural markers, significant clusters, and brain networks).

      We are sorry for erroneously embedding this irrelevant section here. We have removed it, and have double-checked the document to avoid such mistakes.

      (2) I'm no expert here, but some of the trace and density plots in the SOM look problematic (e.g., Figure S5 top panel). But it's not made clear to which model/analysis these plots belong, so they are not very helpful without that information.

      Thank you for bringing these potentially problematic plots to our attention. Following your great suggestion, these results have been removed from the SM to amplify readability and comprehensibility.

      (3) Table S1 reports side effects "from the neurostimulation" (this is also the language used in the main manuscript), but having the flu is rather unlikely to be a side effect from the stimulation, isn't it? Thus, this language is highly confusing, and when reading the main text, it's not clear that these are just life events that are most likely unrelated to the stimulation, but have the potential to affect the measured variables (i.e., ultimately, they seem a source of noise).

      We apologize for this confusing wording. Here, the “side effects” are defined as confounding effects deriving from unexpected life events that uncontrollably disrupt task execution and task performance, such as “having the flu”, or “an unexpected mandatory CCP (Communist Party of China) meeting assignment”. To obviate misunderstanding, we have rephrased “side effects” as “unexpected life events disrupting task execution” in both the main text and the SM section both.

      (4) The use of the English language could be improved.

      Thank you for your very practical suggestion. As you kindly suggested, we have invited a proofreading editor to edit and polish the English of the revised manuscript.

      Reviewer #2 (Recommendations for the authors):

      (1) It would be helpful to include greater detail about the ESM procedure and details of the self-reported tasks. This would help rule out potential confounds of difficulty or learning (e.g., participants may have learned to identify more achievable and less difficult tasks across the sessions, which would mean they are learning to perform the task better rather than to procrastinate less). Further elaboration on the quantification of procrastination measures would help clarify the mechanism underlying this behavior, which is important for clarifying how these effects arise and what aspect of procrastination behavior is being targeted by the tDCS intervention (and rule of alternative explanations).

      We wholeheartedly appreciate your sharing this very crucial recommendation. As we mentioned above, we fully followed your helpful suggestions, particularly by adding massive details to fully report how to collect real-life tasks (with consistent and plausible difficulty across sessions), how to determine sampling time points, and how to quantify metrics (e.g., subjective procrastination willingness score, objective procrastination rate, AUC of task aversiveness, and task outcome value) to the revised manuscript. We do believe that these revisions and clarifications are imperative and necessary. By including these details, we do believe that the readability and clarity have been substantially improved in the current form. Please see the specific revisions and clarifications above.

      (2) It would be helpful to proofread for grammatical and spelling typos (e.g., DLPFC is spelled incorrectly in line 140, Satterwaite is spelled incorrectly in Line 415).

      Thank you for your kind suggestion. Both spelling typos have been corrected, and we have double-checked the revised manuscript to ensure no such typos remain. As you kindly suggested, we have invited a proofreading editor to edit and polish the English of the revised manuscript.

      (3) Please clarify in Figure 4 that a higher AUC is associated with lower task aversiveness (which is stated in the methods but not clearly in the figure).

      Many thanks to you for your helpful suggestion. As you kindly suggested, we have clarified this case in the figure legend.

      Reviewer #3 (Recommendations for the authors):

      I want to see the preregistration.

      Thank you for your helpful recommendation. As we replied above, a serious technical issue on OSF occurred, making our preregistration invisible and inaccessible. OSF has disabled my account, claiming to detect “suspicious user’s activities” in my account. As a result, there is no access to all materials that were already deposited in this OSF account, including this preregistration. We have reconstructed this preregistration based on archived documents, and reported it in the SM. As we reported above, although this partially addresses the problem, it no longer fulfills the best practices of preregistration. Consequently, in addition to transparently reporting this case, we have removed all the preregistration statements throughout the revised manuscript.

    1. Author response:

      Both reviewers noted that some published studies question the association of HPV types with cervical cancer survival {PMIDs 36207323 and 33117670}, while others did not observe that {REFS 69-74 in Chakravarty}. We appreciate both reviewers pushing us to discuss and hypothesize (even speculate) on our finding that HPV types not in phylogenetic clade α9 types (including HPV18) had more recurrences than α9 types (including HPV16). The most likely explanation is that we analyzed 225 HPV types not just the most prevalent types. Specifically, each of the 5 recurrences in our pilot study had different HPV types (α7’s: 18, 39, 45, 70 & α5: 69). Similarly, on re-examination of the TCGA data set, we found that 80% of the 181 α9 samples had HPV16, while 52.5% of the non-α9 samples had HPV18, consistent with a broader variety of types in the latter. We note that PMID: 36207323 did assess a broad number of HPV types, but these were classified into three non-cladistic categories, HPV16, HPV18 and Other for comparison. More in line with the main point of that study, HPV18 was enriched, though not significantly, in the more pathogenic C2 group (which was defined by a deep analysis of specific genomic alterations). It can be speculated that perhaps α9 types are less proficient at effecting or interacting with some C2 characteristic(s). Overall, we suggest that these observations emphasize the importance of examining the full spectrum of HPV types including phylogenetic relationships in cervical cancers induced by these viruses.

      Reviewer #1:

      The detection of “non-tumor HPVs” was noted as a potential limitation. The highly multiplexed, HC+SEQ methodology that we use obviously detects many HPV types and thus can identify lesions with multiple HPV types as occurred in Patient 16 and in other HPV cancers. It is unclear what role multiple HPV types might play in tumorigenesis if any. Regardless of whether broad detection of HPV types proves to be a limitation or an advantage, it will be interesting. Our approach in this study focused on integration of HPV DNAs into human DNA, as this is a key event in cervical tumorigenesis. We believe that detection of clonally expanded cells with an integrated URR-E6-E7 DNA segment of any HPV type (whether high-risk, low-risk, or intermediate, or even perhaps non α-clade {PMID:40742260}) should be viewed with suspicion. For the small fraction of cervical cancers that contain only unintegrated HPV DNA, it will be interesting to see if these viral DNAs share any particular properties.

      The reviewer asked for details of the HPV DNA capture probes used. All were from the proprietary Roche Nimblegen SeqCap EZ System. They encompassed all HPV types from HPV1 through HPV225.

      The reviewer questioned why the data verifying the viral-human DNA junctions in primary tumor tissue by the orthogonal approach of PCR assays PCR assays were not shown. The data summary and the approach used for PCR are in Figure 1, Table 1 and Supplementary Table 1. Only the dozens of agarose gel photographs were not shown. PCR assays that addressed key issues comparing primary and metastatic sites and confirming HPV16 + HPV18 coinfection are shown in Figure 2 and Figures 4A & 4B, respectively.

      Reviewer #2:

      The reviewer raised general issues about data quantification and statistical adequacy. Regarding data quantification, we used a strict, conservative guideline of a 10 read minimum per junction in the DNA from tumor samples. This was based on the sequence analysis pipeline design and on our requirement that some clonal expansion of cells containing specific junctions must have occurred. Extensive complications to comparing quantified read counts in different studies are detailed below in the responses to specific comments. The statistical methods used were based on the dichotomous variable of detection versus no detection of integrated HPV DNA. For this study, we also used the orthogonal method of verifying every junction by PCR with one primer in viral DNA and the other in flanking human DNA followed by Sanger sequencing. The statistical methods used were entirely appropriate for this dichotomous variable and time to event analyses. Nonetheless, we concur that quantification of HPV DNA integration would be an interesting variable to consider once carefully controlled methodologies are applied considering the issues detailed below.

      Regarding the first point about variability in HPV-human junction number in different studies: The number of HPV DNA genome and junction read counts obtained from a sample are subject to numerous technical and biological variables. Extensive caution should be applied when comparing quantitative results among different studies, and this particularly includes the number HPV-human DNA junctions detected. Among the factors that can be involved among different studies are the following: 1) inadequate deduplication of sequence reads; 2) “barcode-hopping” or “bleed-through” from one sample to another and thus cross-contamination of one sample with another during multiplexed short-read sequencing; 3) variation in the fraction of cells that are tumor cells in the post-clinical analysis sample of tissue obtained; 4) artifactual ligation of HPV and human DNA segments occurring at the adaptor ligation step of short-read sequencing; 5) variability in the mismatch settings of computational sequence aligners used; 6) perhaps most importantly, the level of genomic instability of each particular integration locus; and 7) subclonal variation in proliferation or survival of cells containing specific junctions within a lesion. The reviewer correctly implied that our requirement for a minimum of 10 sequence reads at each junction excludes low level, subclonal variants. Nonetheless, one tumor did have two integrations (Table 1). More importantly, we emphasize that all five tumor-recurrences at distant metastatic sites in our study had the exact same integration event as the primary tumor (determined at single nucleotide resolution at both ends). We judge this to be compelling evidence that the approach we use correctly identifies the key integration event underlying each cancer.

      Regarding the second point about ratios between genomic DNA copy numbers and junction read counts: Both human genome and HPV genome copy numbers deserve mention in regard to this issue. HPV HC+SEQ highly enriches for viral DNA, with the advantage gained of high read depth for viral sequences, but with human DNA largely excluded (except for the junction reads). Thus, ratios of junctions to the rest of the human genome cannot be assessed as they can be with whole genome sequencing methodologies. While HPV genome read depth can be ascertained with HC+SEQ reads (as in Figure 1C, 1D, 1E), and the reviewer’s suggestion raises the possibility of using junction to viral read ratios to normalize data to compare different integration loci and even perhaps different studies, there are nonetheless additional, biomedically relevant complications. HPV DNA segments are sometimes often present as tandem units with or without human DNA segments in tumors (Figure 1E shows the former), and this affects the ratio of junctions to viral genomes. Thus, using the suggested ratios would require additional normalization for tandem copy numbers, and thus, it would be difficult to use them in a manner analogous to gene-specific read counts per million total read ratios in RNA-seq.

      Regarding the third point about comparing read counts from primary tumor tissue with those from cfDNA: Ours was a retrospective study using archived samples that were available, and the HPV genome coverage obtained by HC+SEQ using cfDNA varied (Table 1). Assessment of viral DNA genome and human junction reads in a quantitatively reliable manner by HC+SEQ will require application of precise collection, storage, and processing of cfDNA samples. Nonetheless, the results presented in this study, while variable among the different samples, were entirely sufficient to test the dichotomous variable analyzed. We note that this included orthogonal, PCR verification of junctions, based on the straightforward, abundant identification of the junctions by HC+SEQ in the primary tumor samples. We emphasize that examination of HPV DNA integration directly interrogates a key, likely causal event in HPV cervical tumorigenesis.

      Regarding the fourth point about many of the initial cancer samples harboring no junction breakpoints: 100% of the 16 initial, cervical, primary tumor tissue samples harbored an integration (one sample had two). The reviewer is correct that many of the initial cfDNA samples lacked HPV DNA integration as assessed by HC+SEQ and by PCR based on the junctions detected in the primary tumor tissue. We interpret this to mean that these cancers were not spilling genomic DNA containing the integrated HPV DNA into serum at sufficient levels to be detected, and judge this to be due to underlying, unidentified, biomedically-relevant effects.

      Regarding the fifth point about HPV-human DNA junctions being used as a measure of tumor heterogeneity and subclonal variation: We concur with the reviewer that this is an interesting, important issue. We noted it in the response to the “first” point (numbers 6 and 7) above. Again, one of the samples had two integrations, and this patient did not suffer a recurrence (Table 1, Figure 1). Based on our ongoing experience, to take findings of junction subclonality beyond just detection of multiple integration junctions, we believe that development of in situ, single cell approaches are necessary to reveal the full meaningful picture of subclonality.

      Beyond these quantitative issues that we raise in response to Reviewer #2’s comments, the Reviewers’ comments point at important, incompletely understood aspects about HPV tumorigenesis. Our finding of the identical viral DNA insertions in primary tumors and metastases point to a central, constant role for these structures in viral tumorigenesis. Nonetheless, the issues raised point to key questions concerning subclonality, detailed structures and quantification of HPV and human tandem DNA units, intrachromosomal DNA vs. ecDNA, genomic instability of integrated HPV DNA loci, and cell-to-cell variation, and what roles these might play in tumorigenesis.

      Regarding the point about cell-free DNA breakpoints, we note the field of circulating tumor DNA fragmentomics that examines the sequences and a host of structural properties of circulating DNAs derived from tumors including specific, short sequences at the ends (breakpoints) of DNA fragments circulating in blood. These are of emerging significance as biomarkers for cancer {PMIDs:40038442 and 41043439}. We note that cell free DNA breakpoints are not synonymous with DNA junctions. We stress again that the main point of our manuscript was to investigate HPV-human DNA junctions in cfDNA, as this directly addresses a likely causal mechanism underlying HPV cervical tumorigenesis. Additional, future studies would be required to assess the effectiveness of our targeted, individualized approach relative to other aspects of fragmentomics in cervical cancer.

      In summary, we restate one of the reviewers’ points. “This study provides important foundational evidence for further evaluating the clinical utility of HPV DNA detection from cfDNA and specifically assessing for integration junctions.” Both reviewers raised thoughtful points about DNA integration and HPV tumorigenesis, and prospective studies are required to refine and evaluate clinical utility of the new findings presented here.

    1. eLife Assessment

      This important study probes the long-standing failure to resolve evolutionary relationships between the classical "spiralian" taxa-i.e., annelids, molluscs, brachiopods, platyhelminths and nemerteans-and provides convincing evidence that the branches leading to them are so short as to be unreliable guides to their relationships. This, in turn, has wide-ranging implications for our understanding of animal body plan evolution and the interpretation of early animal fossils.

    2. Reviewer #1 (Public review):

      [Editors' note: this version has been assessed by the Reviewing Editor without further input from the original reviewers. The revised version adequately addresses the relatively minor comments from the previous round of review.]

      Summary:

      This interesting paper probes the problematic relationships between the classical "spiralian" taxa, i.e., annelids, molluscs, brachiopods, platyhelminths and nemerteans, and shows that the branches leading to them are so short as to be unreliable guides to their relationships. This, in turn, has important implications for how we view the origin of the animal phyla.

      Strengths:

      A very careful analysis of a famous old problem with quite significant results. The results seem to be robust and support their conclusions.

      It often passes uncommented that many different trees are published about animal relationships, yet some parts of the tree seem extremely difficult to resolve; the spiralians are perhaps the most difficult case. More recently, problems about sponges or ctenophores as sister groups to the rest of the animals have alerted us to major areas of uncertainty in large-scale phylogenetic reconstruction; this paper is a welcome reminder that other, perhaps even harder, problems exist which may be difficult to ever resolve with the (molecular) data we have.

    3. Reviewer #2 (Public review):

      Summary:

      The relationships among the phyla making up Spiralia - a major clade of animals including molluscs, annelids, flatworms, nemerteans and brachiopods - have been challenging from a phylogenomic perspective despite decades of molecular phylogenetic effort. Every topology uniting subsets of these phyla has been recovered with apparent support in at least one study, yet no consensus has emerged even from large-scale genomic datasets. Serra Silva and Telford set out to determine whether this instability reflects a genuine biological signal being obscured by analytical limitations, or whether it reflects a rapid, near-simultaneous origin of these phyla that has left behind in modern genomes far too little phylogenetic information to resolve. They focused deliberately on five phyla, reducing the problem to a tractable set of 15 unrooted and 105 rooted topologies, and applied a suite of complementary approaches across two independent datasets and multiple substitution models to test whether any topology is significantly preferred over alternatives.

      Strengths:

      (1) The conceptual framing of the problem is excellent, and the study makes a convincing case across several lines of evidence. By enumerating all possible topologies and demonstrating empirically that every one of the 15 unrooted arrangements has been recovered as the preferred solution in at least one published study, the authors make a strong argument about the state of the field. The use of two entirely independent datasets as a consistency check is great, and convergence between them, where it occur,s substantially strengthens confidence in the conclusions.

      (2) It is my view that the simulation framework is a particular strength. Generating data on a fully unresolved star tree and scoring those data under both correctly-specified and misspecified substitution models provides convincing evidence that the strong preference for rooting Spiralia on the flatworm branch is, at least partly, an analytical artefact driven by the exceptionally long branch in combination with compositional heterogeneity across sites. This is an important methodological demonstration with implications beyond spiralian phylogenetics, as the same issue is likely to affect other deep, long-branched lineages in the animal tree of life.

      (3) The randomised taxon-jackknifing approach is a very nice addition here. The demonstration that preferred topologies shift depending on which species happen to be sampled (even within the same phylum) is a convincing indicator of weak signal, and provides a practical caution for future studies that may report strong support for a particular spiralian arrangement based on a fixed taxon sample.

      (4) The branch-length analyses, benchmarking internal interphylum branches against the already disputed and extremely short branch uniting deuterostomes (work also by this group), are well-conceived and solid.

      (5) I think it is worth highlighting the notable intellectual honesty throughout the paper: the authors do not overstate their results, correctly acknowledging that while the unrooted topology grouping molluscs with brachiopods and flatworms with nemerteans emerges most consistently, this preference is not statistically significant under more adequate substitution models and may itself carry some artefactual component.

    4. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      This interesting paper probes the problematic relationships between the classical "spiralian" taxa, i.e., annelids, molluscs, brachiopods, platyhelminths and nemerteans, and shows that the branches leading to them are so short as to be unreliable guides to their relationships. This, in turn, has important implications for how we view the origin of the animal phyla.

      Strengths:

      A very careful analysis of a famous old problem with quite significant results. The results seem to be robust and support their conclusions.

      It often passes uncommented that many different trees are published about animal relationships, yet some parts of the tree seem extremely difficult to resolve; the spiralians are perhaps the most difficult case. More recently, problems about sponges or ctenophores as sister groups to the rest of the animals have alerted us to major areas of uncertainty in large-scale phylogenetic reconstruction; this paper is a welcome reminder that other, perhaps even harder, problems exist which may be difficult to ever resolve with the (molecular) data we have.

      Weaknesses:

      The paper could have perhaps drawn out some of the implications of its results in a clearer manner.

      Reviewer #2 (Public review):

      Summary:

      The relationships among the phyla making up Spiralia - a major clade of animals including molluscs, annelids, flatworms, nemerteans and brachiopods - have been challenging from a phylogenomic perspective despite decades of molecular phylogenetic effort. Every topology uniting subsets of these phyla has been recovered with apparent support in at least one study, yet no consensus has emerged even from large-scale genomic datasets. Serra Silva and Telford set out to determine whether this instability reflects a genuine biological signal being obscured by analytical limitations, or whether it reflects a rapid, near-simultaneous origin of these phyla that has left behind in modern genomes far too little phylogenetic information to resolve. They focused deliberately on five phyla, reducing the problem to a tractable set of 15 unrooted and 105 rooted topologies, and applied a suite of complementary approaches across two independent datasets and multiple substitution models to test whether any topology is significantly preferred over alternatives.

      Strengths:

      (1) The conceptual framing of the problem is excellent, and the study makes a convincing case across several lines of evidence. By enumerating all possible topologies and demonstrating empirically that every one of the 15 unrooted arrangements has been recovered as the preferred solution in at least one published study, the authors make a strong argument about the state of the field. The use of two entirely independent datasets as a consistency check is great, and convergence between them, where it occur,s substantially strengthens confidence in the conclusions.

      (2) It is my view that the simulation framework is a particular strength. Generating data on a fully unresolved star tree and scoring those data under both correctly-specified and misspecified substitution models provides convincing evidence that the strong preference for rooting Spiralia on the flatworm branch is, at least partly, an analytical artefact driven by the exceptionally long branch in combination with compositional heterogeneity across sites. This is an important methodological demonstration with implications beyond spiralian phylogenetics, as the same issue is likely to affect other deep, long-branched lineages in the animal tree of life.

      (3) The randomised taxon-jackknifing approach is a very nice addition here. The demonstration that preferred topologies shift depending on which species happen to be sampled (even within the same phylum) is a convincing indicator of weak signal, and provides a practical caution for future studies that may report strong support for a particular spiralian arrangement based on a fixed taxon sample.

      (4) The branch-length analyses, benchmarking internal interphylum branches against the already disputed and extremely short branch uniting deuterostomes (work also by this group), are well-conceived and solid.

      (5) I think it is worth highlighting the notable intellectual honesty throughout the paper: the authors do not overstate their results, correctly acknowledging that while the unrooted topology grouping molluscs with brachiopods and flatworms with nemerteans emerges most consistently, this preference is not statistically significant under more adequate substitution models and may itself carry some artefactual component.

      Weaknesses:

      (1) The restriction to five phyla is the most significant limitation, as the authors acknowledge this and give a clear computational justification, but readers should be aware that the paper's convincing conclusions apply specifically to the five focal phyla and the evidence remains incomplete with respect to spiralian phylogeny as a whole.

      (2) The treatment of substitution model adequacy, while commendably thorough for site-heterogeneous models, is necessarily bounded. The authors note that models accounting for non-stationarity, across-lineage compositional heterogeneity, or mixtures of tree histories might yield different results, and that even the most sophisticated currently available approaches have not produced consistent spiralian topologies across studies. This is not a criticism of what has been done here - the analytical scope is reasonable and well-implemented - but it means the paper cannot be read as a definitive demonstration that no model will ever resolve these relationships. The distinction between a true hard polytomy and a radiation that is effectively unresolvable given current data and methods could be drawn more sharply in the discussion.

      (3) The reticulation-aware coalescent analyses are presented somewhat briefly relative to the likelihood-based topology scoring. The finding that flatworms are recovered within a paraphyletic jaw-bearing animal clade in both summary trees - interpreted as long-branch attraction - is striking, and its implications for gene-tree-based approaches to spiralian rooting deserve more discussion than they currently receive.

      (4) The central conclusions - that interphylum branches in Spiralia are extraordinarily short, that topological preferences are strongly model-dependent and taxon-sampling-sensitive, and that an ancient rapid radiation is the most parsimonious explanation - are convincingly supported by the evidence presented. The identification of flatworm long-branch attraction as an important confounding factor in rooting analyses is itself an important and well-demonstrated result.

      Conclusion:

      This paper clearly makes an important contribution to the ongoing debate about spiralian relationships and, more broadly, to methodological discussions about how to handle anciently diversified clades where phylogenetic signal is genuinely limited. The exhaustive topology-scoring framework combined with taxon-jackknifing and simulation under unresolved trees is a valuable methodological template that could usefully be applied to other notoriously difficult nodes in the animal tree. I thoroughly enjoyed the discussion of the implications of these findings for interpreting Cambrian fossils and the evolutionary history of shells, segmentation, larval types and other characters - it is both thoughtful and thought-provoking and will be of broad interest well beyond the phylogenomics and zoology communities. From a very practical perspective, the data and scripts provided make the work useful to researchers wishing to apply similar approaches to other groups.

      Reviewer #3 (Public review):

      Summary:

      This paper addresses the controversial internal relationships within the Spiralia, a major clade of invertebrate animals including molluscs, annelids, brachiopods and flatworms.

      Strengths:

      Performs a range of empirical analyses and simulations that address the core question. Although a favoured unrooted topology finds some support, this is not strongly endorsed in the paper.

      Weaknesses:

      (1) Only considers a subset of relevant phyla (e.g. gastrotrichs are relevant to the phylogenetic position of Platyhelminthes), although how this would change the scale of the analyses (i.e. number of topologies) is addressed in the paper.

      (2) Discussion of Spiralia evolution and broader context, particularly the relevance for the fossil record. Line 448: our current understanding of the early spiralian fossil record is quite consistent with the main results of this paper. For example, there are very few claims for fossils that sit on the short branch leading to Spiralia (or Lophotrochozoa as defined here) that this paper discusses. Many of the key fossils that inform on the characters discussed in the introduction, which have unusual character combinations, have an apomorphy of one of the phyla discussed, and so are resolved as members of the stem lineages of particular phyla.

      (3) This is what you would expect with long phylum stem lineages (line 148) and a short spiralia stem lineage. For example, the mollusc Wiwaxia has chaetae, but a mollusc like Radula (Smith 2012), the conchiferan mollusc Pelagiella has chaetae and a coiled shell (Thomas et al. 2020). The only fossil groups that are routinely discussed as belonging to the stem lineage of more than one phylum are the tommotiids, which have chaetae, segmentation and a complex mineralised skeleton (but not shells in the brachiopod/mollusc sense, see Guo et al 2023) but they sit on the lophophorate stem lineage, a synapomorphy rich group the monophyly of which the present paper endorses (e.g. line 435). The fossil record is consistent with the scenario presented in line 442, e.g. convergent loss or reduction of chaetae and segmentation and convergent evolution of shells in molluscs and brachiopods.

      We thank the reviewers for their kind comments. Please see below for detailed responses to all identified weaknesses.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      Some minor comments that might help improve the paper:

      (1) Abstract L17. "Most analyses on the 15 unrooted trees showed a preference for the same topology but the support over other solutions was non significant" - I don't really understand this sentence in the context of the paper; it makes it sound as if the tree is, after all, well resolved! Non-significant, or not significant better than non significant?

      Having read the rest of the paper I see what this refers to (uT4), but still I don't understand the second clause.

      Re-written to clarify.

      (2) Introduction L31. This makes it sound as if phoronids are actually part of brachiopods, and while that was recovered by Cohen and Weydmann 2005, I'm not sure if it's really a general result. In addition, rather than using "brachiopods plus phoronids" everywhere, you could use "Brachiozoa" (Cavalier-Smith 1998, Biol. Rev).

      We have updated our text and figures to use Brachiozoa.

      (3) L36-37. Yes, but the presence of Chaetagnatha in this clade is suggestive that their primitive body size is not small.

      Have made clear that chaetognaths are not all tiny.

      (4) L85. Kumar et al. may have claimed that Spiralia are as old as 670, but many other analyses would suggest a range of different results. Why choose just this one? In addition, this age seems rather incompatible with your results.

      We agree this maximum age is highly improbable (the principal point remains the deep age of the protostomes). We have used a different reference and refer to a generally acceptable minimum age only.

      (5) L88. The key part of this sentence, "proving a hard polytomy", comes at the end of a long set of references that makes it hard to connect to the lead-in "given the age of", so I would suggest rephrasing.

      Rephrased for clarity.

      (6) L109. It is unclear what this means in the context: "and even support multiple topologies".

      Re-worded for clarity.

      (7) Figure 1. Why did you choose to indicate brachiopods plus phoronids as a larval form, unlike the other clades? Perhaps it's because we don't know what the last common ancestor of the two looked like (unless P is an ingroup of B), but that's arguably true for some of the other clades as well!

      Apologies, this was laziness as we already had a line drawing of an actinotroch larva. Have improved the images in figures 1 and 5 where required.

      (8) L164. Reticulation-aware analyses. As I understand it, this would include introgression, hybridization, etc. However, incomplete lineage sorting has also been invoked, not just for Cambrian-explosion age events but also for other major radiations, such as for angiosperms and birds. How significant might ILS be for generating the results you get?

      Section title amended. Results section updated to reflect this. We now explicitly mention the potential impact of ILS and introgression on spiralian relationships in our discussion.

      Unrooted trees analysis:

      (9) L405 on. Maybe it would be worth including a figure showing the relative branch lengths of uT4. All the images of trees show similar-length branches, which gives off the wrong impression within the context of the paper!

      We understand the motivation, but we worry that showing uT4 as the sole phylogram may end up with this being interpreted by a casual reader as being the main result of the paper. Hopefully the figures with branch lengths encompass this information well enough and with no danger of misinterpretation.

      (10) L430 on. Why is this a "conservative" interpretation?

      Yes agreed not clear. Have changed to “We interpret our results as showing that…”

      (11) You mention synapomorphy accumulation time and implicitly equate shortness of branches with shortness of time. However, other options are available under varying diversification rate models (e.g. ClaDs, Barido-Sottani et al. 2023 Syst. Biol.; CET, Budd and Mann 2025, Syst.Biol.). In particular, the latter paper shows that when unusually large clades are selected for study (as is arguably the case here), then those clades are likely to have started with very high "evolutionary tempo", which speeds up all aspects of evolution, including diversification rates.

      In the Budd and Mann scenario large clades begin with high tempo of cladogenesis, high substitution rate and high diversification rate (rapid origin of new characters). This would suggest that the period of the radiation was extra rapid (even less time than in a ‘normal’ period during which smaller clades emerge) so we feel the point stands.

      (12) L449. Maybe refer to the Song et al. paper again here on scaphopods plus bivalves, as it makes the same sort of points, albeit in a slightly different context.

      We thank the reviewer for the suggestion and have added the citation where relevant.

      (13) Finally, to return to L20. You mention implications for the Cambrian fossil record, but then fail to deliver any!

      We have hopefully addressed this remark in the discussion better (at least to the extent we are qualified to).

      Yet if you are correct, then synapomorphy accumulation would unite groups of phyla, and would surely lead to a scenario highly incompatible with clock models suggesting deep origins of clades (as they would all be more fossilisable).

      Apologies but we don’t completely understand this point as ‘synapomorphy accumulation would unite groups of phyla’ is a little ambiguous. Of course, this is generally true, but our results suggest there was little opportunity to accumulate identifiable synapomorphies linking pairs, triplets or quartets of our 5 spiralian phyla.

      In addition, clock results suggest rather long periods of time leading to the phyla, which would imply that there would have to be extremely slow rates of molecular evolution to yield the short early branches here. Also, it might be worth referring to papers compatible with this view, such as Wernström, J.V. et al., EvoDevo 13, 17 (2022). https://doi.org/10.1186/s13227-022-00202-8 or some of the palaeo literature, such as Budd and Jackson 2016, Phil Trans.

      The referee refers to clock results suggesting a (deep) Ediacaran origin of Lophotrochozoa/Spiralia. We interpret the spiralian radiation itself as rapid but, in the absence of a clock analysis, we cannot comment on when it took place.

      Reviewer #2 (Recommendations for the authors):

      (My not very) Major points - as I feel this is an excellent paper.

      (1) The coalescent-based summary tree analyses warrant expansion. The recovery of flatworms within a paraphyletic jaw-bearing animal clade in both summary trees is a striking result attributed to long-branch attraction, but this interpretation would be strengthened by examining whether pruning or downweighting the longest-branching taxa within those groups affects the outcome, or by reporting per-node quartet scores more fully. This would make the reticulation-aware results more directly informative and would bring this section into better balance with the detailed likelihood-based analyses.

      We thank the reviewer for the suggestion of the expanded analyses. We have now done these, and they yielded essentially the same results as the unpruned analyses. Additionally, while not discussed, we ran the Astral analyses on the subset of gene-trees where all groups of interest (spiralian phyla and superphyletic Ecdysozoa, Deuterostomia, etc.) were monophyletic and found no changes to interphylum quartet scores beyond those due to enforced (super)phylum monophyly, with Platyhelminths still recovered within Gnathifera.

      We have expanded our description of the results slightly as well as our discussion. Location of the tables with detailed quartet scores and local posterior probabilities has been added to Fig. S1’s legend.

      (2) It would strengthen the paper to include at least a brief analysis or explicit discussion of whether any currently available models accounting for non-stationary or across-lineage compositional heterogeneity show any change in the pattern of support, even if only tested on a subset of topologies. A null result here would itself be informative and would make the conclusions more robust to the concern that unexamined model classes might behave differently.

      We thank the reviewer for the suggestion, but this represents a considerable amount of new work and we think it falls outside the scope of the present work. We have, as suggested, included this as a discussion point.

      (3) The authors note that topologies grouping flatworms with ribbon worms appear among the higher-scoring arrangements even under model misspecification in simulations. It would be helpful to comment explicitly on whether the apparent signal for this grouping should therefore be regarded with particular scepticism, or whether it survives artefact correction in any of the analyses, as this is a grouping that has appeared repeatedly in the literature and readers will want guidance on how to interpret it.

      We do state that the nemertean+platyhelminth grouping seems likely to be at the least emphasised by an artefact (as the referee points out it is common to the higher scoring trees in the star tree simulations). We state that this suggests “…that this grouping derives some support from systematic errors.” We now return briefly to this in the discussion.

      Writing and presentation

      (1) The abstract states that rooting Spiralia on the flatworm branch "is a long-branch artefact" - this is slightly stronger than the language used in the body of the paper, where the authors correctly write that this preference is "at least enhanced by" the artefact. The abstract phrasing should be softened to reflect the more nuanced conclusion in the text.

      Good point. Done.

      (2) A brief signposting sentence near the start of the Results, setting out the overall analytical logic before the individual sections begin, would help orient readers. The strategy - score all topologies, test robustness to model choice and taxon sampling, then use simulation to identify artefactual signals - is clear in retrospect but would benefit from being made explicit upfront.

      We have taken this suggestion on board. The summary seemed in the end better placed as the final part of the introduction.

      (3) Figure 3 is complex and would be easier to interpret with a brief explanatory note in the legend clarifying what a wide versus narrow range of log-likelihood scores across topologies means in practical terms for statistical resolution between trees.

      Added sentence to legend.

      Minor Corrections:

      (1) The Figure 2 legend contains a typographical error: "shorter than the short, disputed deuterostome branch" should read "shorter than."

      Done

      (2) At least one reference appears to carry a future publication year (Ishii et al., 2026) and should be verified for accuracy before final submission.

      This reference is correct per the journal’s website. We did find Google Scholar to list it as being from 2025.

      Reviewer #3 (Recommendations for the authors):

      (1) Abstract/SI definitions of Spiralia/Lophotrochozoa

      While I don't have strong feelings about this, if Spiralia is being used as an apomorphy-based name, then it still might be equivalent to Lophotrochozoa, as spiral cleavage in Gnathostoniula jenneri was illustrated by Riedl (1969). Although no other studies have replicated this observation, this should at least be mentioned.

      Sorry this reference to gnathostomulid spiral cleavage was included in a longer version of the discussion of nomenclature. This was first reduced in length (which was when the mention of gnathostomulid spiral cleavage was dropped) then finally moved to the supplementary material. We have now re-included mention of this in the discussion in supplementary info.

      The SI text suggests that the name Lophotrochozoa, as used in its original form by Halanych et al. (1995), was a node-based definition, and that this name is for the sister group of Ecdysozoa. However, in that paper, the name is actually defined as "as the last common ancestor of the three traditional lophophorate taxa, the molluscs, and the annelids, and all of the descendants of that common ancestor". This definition would exclude Gnathifera, and depending on the internal relationships of the non-Gnathiferan phyla, may be equivalent (or not) to the usage of the name Spiralia adopted in the present paper. The perils of mixing node and apomorphy-based definitions of clades are clear, and the situation is less straightforward than the paper suggests, and (somewhat unhelpfully given the subject of the paper) may only become clearer if the relationships of non-ecdysozoan protostomes are resolved.

      We believe that the community universally understood the definition of Lophotrochozoa following the 1997 paper (by the authors who also provided the original 1995 definition). This 1997 definition included both chaetognaths and rotifers as examples of the Gnathifera. The Spiralia, in contrast, began life not even as a name for a clade but a description of a character shared by some apparently unrelated taxa – similar to a grouping of ‘carnivores’. The introduction of a new name was, we suggest, unhelpful. We hope that by defining our terms up front the meaning in the current paper is clear.

      (2) Introduction

      Line 76. Some references needed regarding claims that there was a polymeric brachiopod ancestor, e.g. Gutman (1978), Temereva and Malakhov (2011), Guo et al. (2023). Likewise for the chaetae of brachiopods, annelids and molluscs, e.g. Schiemann (2017), as it's key to trace where these ideas originated.

      Added

      Figure 1. This is a nice illustration of the uncertainty in the relationships of these groups. However, I kept checking which thumbnail image was which for nemerteans and annelids. A minor suggestion, but perhaps a polychaete instead for the annelid?

      We have replaced the rather poor image of an earthworm with a polychaete and also now include labels. We hope the improved images are more helpful. Good point.

      (3) Results

      Branch length comparison. I understand why the deuterostome stem was chosen as the branch for comparison from the point of view of phylogenetic uncertainty. However, what about the branch leading to ecdysozoa or the branch subtending lophotrochozoan and/or gnathifera? Given that the short internodes are used as an argument underpinning uncertain relationships, can we be sure that Gnathifera is not nested within the group of interest, especially given that Gnathifera contains many long-branched taxa and the root may be misplaced within the group?

      We have added the Lophotrochozoa and Ecdysozoa median lengths to our plots and now discuss both the lophotrochozoan branch in our results.

      Line 249. Given that Spiralia is the group of interest, why were the Gnathiferans also chosen at random?

      The point of the experiment was to see the effect of taxon sampling on the consistency of the resulting topology. Random sampling across the tree seems helpful in this context. We chose Gnathifera as one group to sample from as this ensured they would be present in all trees. This seems appropriate as they are the sister group of the clade of interest and as such their inclusion reflects a choice a typical investigator might make when choosing which species to include. Additionally, as noted in the reviewer’s earlier comment, Gnathifera includes many long-branched taxa and we wanted to ensure our root-placement results were robust to this aspect of taxon sampling.

      (4) Discussion

      Line 448. Our current understanding of the early spiralian fossil record is quite consistent with the main results of this paper. For example, there are very few claims for fossils that sit on the short branch leading to Spiralia (or Lophotrochozoa as defined here) that this paper discusses. Many of the key fossils that inform on the characters discussed in the introduction that have unusual character combinations have an apomorphy of one of the phyla discussed, and so are resolved as members of the stem lineages of particular phyla.

      This is what you would expect with long phylum stem lineages (line 148) and a short spiralia stem lineage. For example, the mollusc Wiwaxia has chaetae, but a mollusc like radula (Smith 2012), the conchiferan mollusc Pelagiella has chaetae and a coiled shell (Thomas et al. 2020). The only fossil groups that are routinely discussed as belonging to the stem lineage of more than one phylum are the tommotiids, which have chaetae, segmentation and a complex mineralised skeleton (but not shells in the brachiopod/mollusc sense, see Guo et al 2023) but they sit on the lophophorate stem lineage, a synapomorphy rich group the monophyly of which the present paper endorses (e.g. line 435). The fossil record is consistent with the scenario presented in line 442, e.g. convergent loss or reduction of chaetae and segmentation and convergent evolution of shells in molluscs and brachiopods.

      We accept these points (though are clearly not experts on these fossils). We have (slightly tentatively given our lack of expertise) expanded our discussion to include these fossil taxa with their combinations of characters.

    1. eLife Assessment

      This study presents a useful database resource containing protein conformations generated through molecular dynamics simulations, with extensive quality evaluation and benchmarking. While the database is well-constructed and professionally organized, the evidence supporting its claimed representation of protein conformational landscapes is incomplete, as the short simulation times and starting structure bias prevent true Boltzmann sampling of the conformational space.

    2. Reviewer #1 (Public review):

      Summary:

      The authors describe a new database that rigorously explores protein conformations.

      Strengths:

      It is extremely well done, using state-of-the-art tools by a group at the top of the field of structural modeling. The evaluation of qualities and the benchmarking of the structures are outstanding, and it is expected that the new database will have a significant impact on the field.

      Weaknesses:

      The authors are using MD simulation to generate some of the structure, and therefore should have access to standard MD energies. I am surprised that no evaluation is provided based on these energies that can be extended to free energies.

    3. Reviewer #2 (Public review):

      Summary:

      The authors developed a dataset of protein conformations by running molecular dynamics simulations starting from both native and decoy conformations for a large number of proteins. These conformations were put together as a dataset for querying and downloading, along with their energies under different force fields. The authors suggest that such conformations represent the proteins' conformational landscape, so that they will be useful for evaluating methods generating multiple conformations of proteins.

      Strengths:

      The dataset is online and working. It has good documentation for others to use.

      Weaknesses:

      The biggest weakness is that the collected conformations very likely do not represent the true conformational landscape. To represent the conformational landscape, the structures need to be sampled based on the Boltzmann distribution. However, in this study, conformations are generated by running very short (125ps to 375ps) MD simulations starting from near-native conformations and decoys. Such short simulations will produce small fluctuations around the starting conformations, so the distribution of conformations is largely dominated by the distribution of the initial conformations, which by one means are Boltzmann distributed. A conformation might be physically plausible, but it might have very small weight in the Boltzmann distribution. On the other hand, conformations with large weights might not be in the dataset.

    4. Reviewer #3 (Public review):

      Summary:

      This manuscript describes a web-based tool that allows researchers to compare large numbers of representative ("plausible") conformations of proteins. It also includes energetic analysis from multiple widely used structure-prediction methods.

      Strengths:

      This tool will likely be useful for students who want to learn more about the ensemble properties of proteins. The resource is well organized and it represents a large amount of computing resources.

      Weaknesses:

      It is not entirely clear how the database may be utilized by other groups to advance research. It could be helpful if the authors add a short section that provides example use cases that illustrate how this database can support new strategies for studying protein dynamics.

    1. eLife Assessment

      This is an important study uncovering a new role of the SETD6-PPARγ axis in the regulation of hepatic lipid metabolism. The data convincingly demonstrate that methylation of PPARγ by SETD6 plays a key role in this process, linking lysine methylation to transcriptional control of lipid storage genes.

    2. Reviewer #1 (Public review):

      Summary:

      In this manuscript from the Levy lab, the authors investigate whether SETD6 regulates hepatic lipid accumulation through direct methylation of PPARγ. They show that SETD6 binds and mono-methylates PPARγ at K170, and provide evidence that this modification enhances PPARγ occupancy at target promoters, promotes expression of lipid metabolism genes, as well as facilitates lipid droplet accumulation in HepG2 cells. The authors also find a positive feedback loop or circuit in which PPARγ activates SETD6 transcription in a methylation-dependent manner, thereby reinforcing this lipogenic program. Overall, the work presents a novel SETD6-PPARγ regulatory axis linking lysine methylation to transcriptional control of lipid storage genes, with possible relevance to NAFLD-associated biology.

      In all, I find this to be an important paper that describes and advances a new regulatory pathway that has significance to human health and disease. It would also be of interest to a broad audience. That said, there are also some concerns that the authors should address, as outlined below.

      Major concerns (pertains to rigor - highest priority)

      (1) Overall, the work presented is of high quality, and the data nicely support the conclusions; however, a few panels should be strengthened that have missing controls or information:<br /> a. The co-IP panel in Figure 1B lacks a lane where HA SETD6 is expressed without PPARγ. This control is needed to verify that the SEDT6-HA signal depends on PPARγ.<br /> b. In Figure 1C, the authors should show that the co-IP works in both directions (include IP for PPARγ/blot for SETD6). I am a bit confused also over the labeling with IP on the left and on top of the panel next to the beads label. More importantly, the data would be stronger if the authors took advantage of a deletion line to validate that the co-IP is specific to the presence of both.<br /> c. The same IP labeling issue exists for Figure 3B (label is on the same and on top).<br /> d. Antibody information (e.g., where the pan-methyl Ab comes from and at what dilutions they are used at) is missing.

      Nice to have experiments (medium priority - strongly consider)

      (2) A missing gap is how K170me1 contributes to DNA binding and gene transcription. One possibility is that methylation enhances the DNA-binding activity of PPARγ. Given that the authors have all of the reagents, it would be possible to perform a gel shift assay (or other approach) with and without SETD6-mediated methylation. Is DNA binding affected/enhanced?

      (3) Along these lines, I wonder if there is another possibility: could SETD6-mediated methylation of PPARγ drive SETD6-PPARγ interaction? In other words, in the K170R, is SETD6 still even associated with PPARγ, and this interaction is required for promoter recruitment? Alternatively, would a catalytic dead version of SETD6 fail to associate with PPARγ? Currently, no experiments test the impact of an unmethylatable version of PPARγ or a catalytic dead version of SETD6 on SETD6-PPARγ interaction or SETD6 recruitment to promoters.

      Minor concerns (text and figure display)

      (4) The text has multiple typos and grammatical errors, and there are some issues with the figure display.

    3. Reviewer #2 (Public review):

      Summary:

      In this work, the authors investigated the regulation of the transcription factor PPARγ by the post-translational modification lysine methylation. The data demonstrate that the lysine methyltransferase SETD6 targets PPARγ for methylation using biochemical and cell-based assays. Methylation of PPARγ occurs in its DNA binding domain, and the authors demonstrate that loss of methylation limits PPARγ chromatin binding, particularly to lipid storage and metabolism gene promoters. As a physiological output, the authors demonstrate that deletion of SETD6 and loss of PPARγ methylation also disrupt lipid droplet accumulation in hepatocytes. In addition, the authors uncover a positive feedback loop in which SETD6 methylation of PPARγ also regulates its binding to the SETD6 promoter and expression of the gene.

      Strengths:

      One of the key strengths of this manuscript is the novelty of the findings in terms of identifying a new mode of regulation of PPARγ that modulates its chromatin association in cells and thereby regulates lipid metabolism genes. The authors nicely combine biochemical studies of SETD6 activity with cell-based assays investigating PPARγ and SETD6 function in regulating lipid storage. Data supporting this conclusion is largely convincing, and frequently, multiple assays are used to provide sufficient support to the conclusions. This work therefore expands regulatory modes of PPARγ and identifies a new target for SETD6, an enzyme that targets a number of other transcription factors. Furthermore, the regulatory loop that controls SETD6 expression via PPARγ methylation is likely important for understanding SETD6 function in different cell types that have high levels of lipid accumulation or regulation. The gene expression and lipid accumulation assays are useful for testing the physiological outcome of loss of SETD6 activity or PPARγ methylation directly.

      Weaknesses:

      The data presented in the manuscript are largely convincing in support of the authors' conclusions; however, there are some errors in the presentation of the figures and some issues in the text that would benefit from editing. Furthermore, there are some important questions not fully addressed in the results or discussion. It would be great if the authors could speculate more on the diverse roles of SETD6 in methylated transcription factors and/or provide more context regarding the conditions that are likely to support methylation of PPARγ by SETD6. Also, while a potential cross-talk between methylation and phosphorylation is described in the discussion, it would be great to provide more structural insight into how this might regulate DNA binding of PPARγ and/or discuss whether there are other possibilities given the location of the target lysine in the DNA binding domain.

    1. eLife Assessment

      In this useful manuscript, Yang et al attempt to show that platelet recruitment to the liver via macrophages contributes to APAP-induced liver injury, but there were many areas where the data supporting the conclusions were incomplete. For example, the idea that platelets only affected KC glycolysis, but not the metabolism of other cells, to mediate the phenotype after injury is not adequately supported by the evidence. It is recommended to perform additional experiments to strengthen the conclusions.

    2. Reviewer #1 (Public review):

      Summary:

      In this manuscript, Yang et al expand on their previous work showing that platelet recruitment to the liver via liver macrophages is important for APAP-induced liver injury. Here, they show that platelets induce a glycolytic switch in liver non-parenchymal cells, including Kupffer cells, and that this is mediated by the protein Aldolase A produced by platelet-derived extracellular vesicles (PEV). They show that targeting Aldolase A may be a valid therapeutic strategy for severe APAP injury.

      Strengths:

      (1) They nicely showed that platelet effects in APAP are mediated by Aldoa via platelet-derived extracellular vesicles.

      (2) Their data show that one of the effects of platelets in APAP liver injury is inducing metabolic switch to the glycolytic pathway, including in KCs.

      (3) Their data points to the therapeutic potential of targeting ALDOA in severe APAP liver injury.

      Weaknesses:

      (1) They have not shown that the platelet-induced glycolytic switch is only in KCs.

      (2) They also have not shown that KC's role in APAP injury is primarily mediated by their interaction with platelets and the subsequent glycolytic switch.

    3. Reviewer #2 (Public review):

      Summary:

      In this manuscript, the authors have investigated the role of platelet-derived ALDOA in liver injury induced acetaminophen (APAP) induced acute liver injury. There are some major flaws in data interpretation as described below. While a decrease in liver injury due to platelet depletion and lower injury in platelet-specific ALDOA KO mice seems real, the claims related to EVs and Platelet-KC crosstalk are not well supported.

      Strengths:

      Core findings are interesting and supported by the data

      Weaknesses:

      (1) At least two additional timepoints, one at 6 hr and another at 24 hr should be performed in the APAP model to better understand the dynamics of liver injury, especially after platelet depletion.

      (2) Interpretation of the experiments in Figure 2 with clodronate is flawed. 2-DG pretreatment and CLDN administration alone both seem to decrease liver injury substantially, so it is not surprising to see very little injury in the 2-DG+CLDN group.

      (3) Since both 2-DG and CLDN were administered pre-APAP, it is possible that they may interfere with APAP metabolism. This should be checked by looking at GSH depletion at 30 min post APAP treatment. The same question goes for S2 figure data.

      (4) There are no data on specific steps of APAP toxicity, such as GSH depletion, JNK activation, mitochondrial injury, etc., which are all well characterized in any of the studies. Rather, only injury endpoints are measured. It is critical to measure the mechanistic steps. This applies to all studies, but most importantly to the ALDOA-PF-KO mice in Figure 6.

      (5) Interpretation of data in Figure 5F is flawed. Since depletion of platelets also decreases liver injury along with the platelets, it can not be deduced that the decrease in ALDOA is only in platelets. Many other things are changing.

    4. Reviewer #3 (Public review):

      Summary:

      The authors address the possibility that platelet (PLT) derived EVs are important mediators of acute liver injury. Furthermore, KCs are important mediators of inflammation and are noted to need to undergo metabolic reprogramming to achieve their effects during injury. They use an APAP-induced liver injury model (AILI). They show that PLTs are recruited and that they interact with KCs in this model system. RNA-seq of KCs showed upregulation of glycolysis and gluconeogenesis. PLT depletion led to reduced liver injury. RNA-seq of KCs showed downregulation of glycolysis. In vitro co-culture of KCs and pets recapitulated the glycolysis findings. In vivo, 2DG inhibited liver injury, but not in the setting of KC depletion. They went on to show that PLT-derived EVs mediate this effect on KCs using a mix of in vitro and in vivo assays, although control EVs were lacking. After doing mass spec on EVs, they find that ALDOA is the critical payload of the PEVs that mediates the pro-glycolytic effect in vivo. They both delete ALDOA from PLTs, and they use an ALDOA inhibitor to show that injury in AILI requires ALDOA.

      Strengths:

      This is generally an interesting series of observations with an elegant mechanism. Many of the experiments are done in vivo with highly rigorous KO models. However, in many of the EV experiments, there are concerns about a lack of appropriate controls that might limit the rigor of those aspects of the study. 

      Weaknesses:

      (1) There is strong variability in the gene expression between mice in Figure 1B. I worry that the signals may not be statistically significant. The authors should assess the statistical significance.

      (2) In Figure 2B, the necrosis areas that are circled in the image do not seem to resemble the quantitation on the right. For example, I don't see 60% necrosis in the APAP PBS group. Also, I don't see 5-10% necrosis in the CLDN APAP group. More images that are clearer are needed, and circled necrosis areas should be shown.

      (3) In Figure 2D, a higher N should be shown. The number of mice (3) is different from the other experiments, so the exclusion of those mice should be explained.

      (4) In general, control EVs from a non-PLT source should be used for all EV-related experiments. EVs derived from AML12 hepatocytes would seem to be a reasonable control for some of the experiments. Otherwise, it is hard to know if this is a general EV effect or one that is specific to PLT-derived EVs. In Figure 3B, EVs from non-PLTs should be used as a control. Since it is possible that all EVs express some level of TSG101 or CD63. In addition, control EVs should be used to test effects on KC metabolism, since the claim is that the effects are specific to PLT-derived EVs. Similarly, Figure 4 needs some kind of EV control that is not from PLTs.

      (5) Figure 5B should include an EV control in the blot. Most of the blots need controls from AML12 EVs or from another in vivo source.

      (6) It is a little difficult to imagine how enough ALDOA protein could be transmitted from PEVs to influence KC glycolysis on the gene expression level. It is possible that ALDOA is required for PLT-induced activation of KCs, or that EVs from PLTs can induce a metabolic shift in KCs. However, it has not been definitively shown that ALDOA from PEVs is directly causing the KC activation. Ultimately, it would be good to obtain PEVs from ALDOA WT and KO mice, then provide these PEVs to AILI mice without PLTs to see if they have differential effects on the AILI model. This would really demonstrate that the ALDOA in the PEVs is mediating the glycolytic, injurious effect.

    1. Author response:

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      Kim and Parsons present a timely overview of the NTR/prodrug system and its applications in regenerative biology research, with particular emphasis on tissue-specific cell ablation. The system has substantially advanced the field by enabling non-invasive, conditional cell elimination, and has proven especially powerful in zebrafish, though applications in other classical model organisms are also noted. The review covers the historical origins of the NTR system, its use in regeneration studies, small molecule screening, and genetic and CRISPR-based screening, as well as future directions, including the development of the highly efficient NTR2 enzyme variant.

      Strengths:

      This is a useful and well-structured contribution. The manuscript is a valuable resource for the regeneration biology community.

      Weaknesses:

      The impact and scientific value of this paper could be meaningfully enhanced by addressing several points outlined below. The concerns centre on completeness, conceptual precision, and the depth of mechanistic discussion.

      (1) Title: Species specificity.

      Given that the review's primary focus is the zebrafish model, it would be appropriate to include the species name in the title. This would improve discoverability and accurately set the scope of the article for prospective readers.

      Thank you for this suggestion. In revising the review, we have substantially expanded the content to address the reviewers' comments, including adding more detail on the use of NTR in other species. We agree that the majority of published work, and the research we cover, has been conducted in zebrafish, and we have clarified this in the abstract and introduction. However, our aim in writing the review was also to highlight that there is no intrinsic barrier to adopting this technique more broadly in other systems. Notably, NTR was first developed in mice, but with a prodrug that proved difficult to use, and it was not widely pursued. In mouse models, the development of DTR offered an alternative, though that approach carries risks of kidney toxicity and is incompatible with chronic ablation due to immunogenicity. Given this context, we would prefer to retain a title that does not limit the scope exclusively to zebrafish, so as not to discourage readers working in other model systems who might benefit from considering the NTR system.

      (2) Subchapter: Physical injury.

      The subchapter enumerates different types of physical injury models but would benefit from a more substantive comparative discussion. In particular, the authors are encouraged to address the following:

      (2.1) Outcome comparison: Surgical and other invasive approaches cause damage to entire tissue structures comprising multiple cell types, whereas tissue-specific genetic ablation eliminates a defined cell population while leaving the surrounding architecture largely intact. This fundamental distinction has direct implications for the interpretation of regenerative outcomes and should be clearly articulated.

      We appreciate the reviewer raising these important points, as well as those noted in Section 2.2. We addressed the concerns from Sections 2.1 and 2.2 throughout multiple parts of our review, specifically in the following sections:

      • Physical injury – where we highlight the importance of precisely characterizing the nature and extent of tissue damage in order to appropriately interpret subsequent biological responses.

      • Chemogenetic cell-specific ablation – where we expand on this theme by discussing the advantages of selectively eliminating discrete cell populations and how this improves mechanistic interpretation of regeneration.

      • Development of NTR as a suicide gene – where we examine apoptotic pathways and their relevance to nitroreductase-mediated cell ablation.

      • NTR/prodrug systems in regenerative studies – where we compare what is currently known about immune activation and inflammatory responses across different NTR-based ablation paradigms.

      (2.2) Inflammatory response: Invasive injuries typically trigger a robust inflammatory response, which itself can be a potent driver of regeneration. By contrast, genetic cell ablation may elicit a qualitatively different inflammatory reaction. A comparative discussion of this distinction would help readers appreciate a critical limitation of genetic ablation systems relative to models of natural, accidental tissue damage.

      Please see above response 2.1

      (3) Subchapter: Cell-specific toxins.

      This subchapter would benefit from several targeted expansions:

      (3.1) Off-target effects: The authors should include evidence that the exemplified drugs have known off-target activities, with a discussion of how these confounded the interpretation of experimental data. At least a few concrete published examples should be cited.

      Thank you very much for the comments. We have strengthened the discussion of off-target effects by adding concrete published examples. We now note that MPTP/MPP⁺ can affect noradrenergic and serotonergic systems in addition to dopaminergic neurons, that aminoglycoside antibiotics can damage support cells and afferent neurons at higher concentrations with compound-specific differences in ototoxicity, and that streptozotocin exhibits hepatotoxicity beyond pancreatic β-cells.

      (3.2) Completeness of the toxin list: The current list appears illustrative rather than comprehensive. A more complete enumeration would be valuable, particularly for neurotoxins and drugs targeting sensory cells, as these are highly relevant to the zebrafish regeneration field.

      We have now consolidated the toxins discussed throughout the review into Table 1, which includes additional entries alongside the previously listed agents. We have explicitly noted that this list is representative rather than exhaustive, as the full range of cell-specific toxins used across species is extensive.

      (3.3) Interspecies differences: It would be informative to specify whether drug specificity differs across species, as this is a practical consideration for researchers working in organisms other than zebrafish.

      We appreciate the reviewer’s question regarding potential interspecies differences in prodrug performance. Early work using NTR in mammals was conducted in mice, and all five published mouse studies relied exclusively on CB1954. No other NTR-activating prodrugs have been reported in mouse models, so direct comparisons are not available. Likewise, all published Xenopus studies used MTZ and thus do not provide internal comparisons across prodrugs. The Nematostella study employed NFP (citing rationale from a zebrafish study) and the approach yielded effective ablation.

      The only non-zebrafish study that directly compared prodrugs is the Drosophila work, which evaluated MTZ, RNZ, and NFP and reported lower activity for MTZ relative to the other compounds. Because it is not clear whether the authors were aware of the batch variability of MTZ or the need for freshly prepared solutions, interpreting this specific comparison is difficult.

      To address the reviewer’s comment, we have expanded the section on non-zebrafish organisms to clearly state which prodrug was successfully used in each species. However, given the limited number of studies, the absence of titration experiments, and the lack of standardized conditions across laboratories, we do not feel that the available evidence supports drawing conclusions about interspecies differences in prodrug performance.

      Consistent with our original discussion and based on the broader biochemical and empirical data available, we continue to recommend RNZ as the starting point for new experiments.

      (4) Subchapter: Optogenetic cell ablation.

      The authors note that optogenetic cell ablation has not yet been applied in conventional regeneration studies. It would strengthen this section to include a discussion of the underlying reasons for this gap, whether technical or biological, so that readers can appreciate the barriers and potential for future adoption.

      We thank the reviewer for this helpful suggestion. As recommended, we have added a concise, explicitly speculative statement discussing potential technical factors that may explain why optogenetic cell ablation has not yet been widely applied in regeneration studies. Specifically, we note that KillerRed-based ablation requires localized light delivery and ROS generation, making it best suited for discrete, optically accessible cells and less practical for targeting large or deep tissues. We also highlight that the dependence on microscopy-based illumination inherently limits throughput. This new text clarifies possible barriers to broader adoption while acknowledging that these points remain speculative.

      (5) Terminology: "Suicide gene".

      The use of the term "suicide gene" to nitroreductase is conceptually imprecise and merits reconsideration. Strictly speaking, a suicide gene is one whose expression alone is sufficient to kill the cell, as in the case of genes encoding direct triggers of apoptosis or the catalytic A subunit of diphtheria toxin (DTA). NTR does not meet this criterion: it requires the exogenous administration of a prodrug (e.g., metronidazole) to produce a cytotoxic metabolite and is therefore only conditionally lethal.

      It is worth noting that nitroreductases evolved in bacteria and fungi as enzymes involved in chemoprotection and detoxification, converting potentially toxic and mutagenic nitroaromatic compounds into less harmful metabolites (PMID: 18355273). This biological context further underscores that NTR is not inherently a lethal protein. The authors are encouraged to replace or qualify the term "suicide gene" and instead adopt terminology that more accurately reflects the conditional, prodrug-dependent nature of the system.

      We appreciate the reviewer’s thoughtful attention to terminology. We agree that, in its most classical and stringent sense, a suicide gene is one whose expression alone is sufficient to induce cell death. We also recognize that NTR does not meet this strict criterion.

      At the same time, we note that the term has broadened in contemporary usage, particularly within applied and translational contexts, to encompass prodrug-dependent systems. For example, the National Cancer Institute Thesaurus defines a suicide gene as “a gene which will cause a cell to kill itself, typically through interaction with a prodrug,” and Taber’s Medical Dictionary likewise states that it is “a gene that causes a cell to kill itself, usually by encoding an enzyme that converts a nontoxic prodrug into a toxic metabolite.” Under these widely used definitions, NTR is included within the scope of suicide gene systems.

      Nevertheless, we appreciate that terminology in this area is not universally standardized. To ensure clarity for all readers, we have added a brief definition in the revised manuscript explicitly noting the conditional, prodrug-dependent nature of NTR-mediated ablation. We are grateful to the reviewer for prompting this clarification.

      (6) NTR/MTZ in regenerative studies: Mechanistic depth.

      While the review catalogues several studies employing the NTR/MTZ system, it lacks mechanistic depth regarding the cellular basis of ablation. The following questions should be addressed, where evidence exists in the literature:

      (6.1) Temporal dynamics of cell death: What is known about the kinetics of NTR/MTZ induced lethality across different tissue types in larval and adult zebrafish, as well as other organisms? Are there age- and tissue-specific differences in the speed or completeness of ablation?

      Thank you for this important question. We have added text noting that the kinetics and completeness of NTR/prodrug-mediated ablation vary across experimental contexts, including with differences in NTR expression, enzyme/prodrug pairing, dose, cell type, and developmental stage. Published studies illustrate that the time course of ablation can differ substantially between models. Because most studies were designed to optimize ablation within individual tissues rather than for direct side-by-side comparison, the literature does not yet support broad quantitative conclusions about age- or tissue-specific differences across systems.

      (6.2) Mechanism of cell death: What is the cellular basis of NTR/MTZ-induced cytotoxicity in zebrafish? In particular, do the toxic metabolites preferentially cause mitochondrial damage or nuclear DNA damage, and what downstream death pathways are engaged?

      Thank you for the comments. We have added text discussing the mechanism of NTR/MTZ-induced cell death. We now note that NTR-mediated reduction of MTZ generates reactive intermediates that cause DNA damage and oxidative stress, with cell death occurring predominantly through apoptosis. We have also more strongly emphasized that in dopaminergic neurons, mitochondrial damage was identified as the primary cytotoxic mechanism. We acknowledge that the relative contribution of these pathways is likely to vary by cell type and remains an important area for future study.

      (6.3) Proliferative versus post-mitotic cells: Are proliferating and non-proliferating cells equally sensitive to the NTR/MTZ system, or does the proliferative status of a cell influence susceptibility? This is a practically important question for researchers designing ablation experiments in tissues with mixed cell populations.

      We appreciate the reviewer’s insightful question. We have now added a brief clarification to this section explaining that the NTR/MTZ system has been shown to act in a cell-cycle–independent manner, and both proliferating and post-mitotic cells can be ablated effectively.

      (6.4) Ablation of progenitor cells: Are there published examples demonstrating that co-ablation of differentiated functional cells and organ-specific progenitor cells abolishes regenerative capacity? Such examples would be highly informative in illustrating the system's power to dissect the cellular requirements for regeneration.

      To our knowledge, the zebrafish lateral line currently provides the clearest example in which NTR-mediated ablation of progenitor populations results in a loss of regenerative capacity. In this system, targeted ablation of support-cell progenitors severely reduces hair-cell regeneration, illustrating how NTR enables direct testing of cellular requirements for tissue repair.

      Addressing the points above, particularly the comparative discussion of injury models and inflammatory responses, the clarification of terminology, and the mechanistic discussion of NTR/MTZ-induced cell death would substantially strengthen the review's scientific contribution and utility.

      Reviewer #2 (Public review):

      Summary:

      Kim and Parsons reviewed the nitroreductase (NTR)/prodrug system: when engineered cells expressing the enzyme NTR are treated with prodrug (e.g. metronidazole), NTR converts the prodrug into a cytotoxic compound that kills these cells. The review covers how the system has been developed, spatiotemporal control of targeted cell ablation, and its broad utility to study regenerative mechanisms, model human diseases, and screen chemicals to discover pro-regenerative and protective compounds. They further discussed the newer version of NTR, a more potent prodrug, and experimental design, which not only expands the possible utility of the NTR/prodrug system, but also allows the research community to develop a precise, reproducible and versatile platform.

      Strengths:

      The review summarized landmark work application of the NTR/prodrug system, and recent studies, with focus on the model organism zebrafish. The review provides a good gateway to understanding the system and considering regenerative studies.

      Weaknesses:

      No weaknesses were identified by this reviewer.

      Reviewer #3 (Public review):

      Summary:

      This manuscript by Kim and Parsons presents an overview of the nitroreductase/metronidazole (NTR/MTZ) cell ablation system.

      Strengths:

      This manuscript nicely places the NTR/MTZ system in the context of other cell ablation methods, with a discussion of their respective advantages and disadvantages. This review is particularly useful for highlighting the many ways the NTR/MTZ system has been applied to study the regeneration of multiple cell types and to model different degenerative human diseases. The review concludes with a discussion on recent improvements made to the system and practical considerations and "best practices" for NTR-based experiments. This review could be a helpful resource, especially for researchers new to regeneration or cell ablation studies.

      Weaknesses:

      Although the NTR/MTZ system has been used in other model organisms, this review is primarily focused on its uses in zebrafish. While this is understandable given the wide adoption of NTR/MTZ in the zebrafish field, discussion of the unique considerations and/or challenges for non-zebrafish systems would be an interesting addition and could broaden the potential audience for this review. Additional minor revisions, as suggested below, could also improve readability.

      Recommendations for the authors:

      Reviewer #2 (Recommendations for the authors):

      Since the lab mouse is an important mammalian model system, with certain tissues harbouring some regenerative capabilities, including the peripheral nervous system (e.g., sciatic nerve regeneration after crush), and myelin, etc., it would be great if a section could be included to discuss the potential adoption of the NTR/prodrug system in future mouse studies.

      We appreciate the reviewer’s suggestion to discuss the potential future use of the NTR/prodrug system in mouse models. In surveying the literature, we identified only five mouse studies employing NTR, all of which used CB1954. These early studies were conducted primarily as proof-of-principle work in the context of gene-directed enzyme prodrug therapy (GDEPT) for cancer, rather than for regenerative or lineage-specific ablation applications. We added this point to the text.

      Since those reports, we have not found additional examples of NTR use in mice. We do not know the precise reasons for this limited adoption, but it may reflect the availability of alternative ablation systems that are widely established in mouse research, such as the diphtheria toxin receptor (DTR) system.

      We agree that certain mouse tissues exhibit regenerative capacity and that targeted ablation tools can be valuable in such contexts. To address the reviewer’s point, we have added text noting the very limited historical use of NTR/CB1954 in mouse. We have no explanation as to why no one moved onto using NTR/MTZ in the mouse but note in two places in the text that DTR is preferred method to use in mouse ablation experiments (even though DT does cause kidney damage and is incompatible with chronic studies!).

      Minor:

      (1) Line 174-176, the sentence was repeated.

      (2) Figure 1, for the transgenic line, please be consistent with the line name in italics.

      Reviewer #3 (Recommendations for the authors):

      (1) In the abstract as well as in the main text, the authors note that the NTR/MTZ system has been used in multiple model systems. Yet, most of the review, and especially the practical advice given at the end, is very zebrafish-focused. Although this is understandable given the wide adoption of NTR/MTZ in the zebrafish field, the authors might consider revising the abstract to make it clearer that this review is primarily concerned with the use of the NTR/MTZ system in zebrafish.

      Thanks for the suggestion. We have changed last half of first paragraph in abstract

      That said, a brief discussion of any unique considerations and/or challenges for non-zebrafish systems would be an interesting addition and could broaden the potential audience for this review.

      Agreed and we have expanded in several places in the text to discuss more about the NTR system in non-zebrafish. We especially expanded our discussion about NTR in the mouse.

      (2) Line 176: There is a repetition of the sentence, "NTR/MTZ-mediated ablation has also been adapted for other model organisms."

      Found and deleted. Thank you!

      (3) Line 177: To improve clarity, the authors should include species names to prevent confusion. For example, both Xenopus laevis and Xenopus tropicalis are commonly used model organisms. Similarly, multiple Drosophila species are used by researchers.

      Added melanogaster and laevis to text.

      (4) Can the authors address whether alternatives to MTZ (RNZ, etc.) have the same issues with batch-to-batch variability? That might be an important consideration for potential users. It would also be useful to include practical guidance for accounting for batch variability, for example, how to determine optimal prodrug concentrations, whether effective concentrations need to be determined for every batch/replicate/experiment, etc.

      Added text that discusses that, it is not yet known whether RNZ exhibits batch-to-batch variability similar to MTZ, as this has not been systematically reported. Given the potential for variability, it would be prudent for researchers to titrate each new batch of RNZ or, alternatively, adopt a dosing strategy that exceeds the minimum effective concentration to ensure consistent ablation results.

      (5) For the last section ("Experimental design: Practical and technical considerations"), readability would be improved by applying a consistent bullet point format.

      Made the changes as requested.

      (6) Figure 1: Asterisks are not defined.

      The asterisks where to link to two boxes depicting the same transgene without rewriting the name of the transgene. Clearly, this wasn’t clear, so we have added explanation to legend too.

      (7) Figure 3: Given that the schematics specify expression of NTR1 and NTR1.1, I assume this figure is adapted or based on previous published report(s). If so, the reference(s) should be noted in the figure legend or on the figure itself (as done for Figure 1). If the schematic is meant to depict only in general terms how binary expression vectors can be used, a more inclusive "NTR" label might be less confusing.

      Changed figure legend and figure

      (8) Figure 4: To improve readability and accessibility, the authors should consider modifying panels C-N to use a more colorblind-friendly palette (e.g., green/magenta) or to present each channel as separate grayscale images.

    2. eLife Assessment

      This Review Article nicely synthesizes the development, applications, and recent technical advances of the nitroreductase/prodrug system, highlighting how it enables precise spatiotemporal cell ablation and experimental platforms for studying regenerative mechanisms and screening for pro-regenerative or protective compounds. Together, the article provides a conceptual and practical overview that will help researchers adopt and further develop this versatile approach in regenerative biology. It will be of interest to researchers studying regeneration, disease modelling, and targeted cell ablation, particularly those working with zebrafish and other genetic model systems.

    3. Reviewer #1 (Public review):

      Summary:

      Kim and Parsons present a timely overview of the NTR/prodrug system and its applications in regenerative biology research, with particular emphasis on tissue-specific cell ablation. The system has substantially advanced the field by enabling non-invasive, conditional cell elimination, and has proven especially powerful in zebrafish, though applications in other classical model organisms are also noted. The review covers the historical origins of the NTR system, its use in regeneration studies, small-molecule screening, and genetic and CRISPR-based screening, as well as future directions including the development of the highly efficient NTR2 enzyme variant.

      Strengths:

      This is a useful and well-structured contribution. The manuscript is a valuable resource for the regeneration biology community.

      Weaknesses:

      The revised manuscript shows significant improvements; however, two points remain insufficiently addressed and should be resolved in the final version.

      (1) The term 'suicide gene'

      As noted in my first round of revisions, the term 'suicide gene' as applied to bacterial nitroreductase remains unaddressed in the revised manuscript, despite being scientifically inappropriate and a potential source of confusion regarding the NTR/Mtz mechanism.

      'Suicide' implies an intrinsic, cell-autonomous programme of self-destruction. This is incompatible with the NTR/Mtz system, in which cell death is experimentally induced through exogenous administration of metronidazole (Mtz) by the investigator. While the 'suicide gene' framing may have utility in the cancer therapy literature, likely to aid communication with non-specialist and clinical audiences, however, it is not standard in the zebrafish field, where NTR is more accurately described as a conditional toxigene. Since this review focuses predominantly on zebrafish models, its terminology should reflect that of the relevant literature.

      A further conceptual problem with the 'suicide gene' framing is that it obscures the pharmacological nature of Metronidazole. Mtz is a pharmaceutical agent with intrinsic baseline toxicity: extended exposure or modestly elevated concentrations cause toxic side effects and lethality even in non-transgenic (wild-type) zebrafish (PMID: 24428354). NTR-expressing cells do not self-destruct; rather, they are rendered selectively hypersensitive to Mtz relative to other eukaryotic cells by virtue of expressing the enzyme. This distinction is mechanistically important and should be reflected in the language used throughout the manuscript.

      In summary, the term 'suicide gene' does not accurately capture enzyme-mediated bioactivation of an exogenous prodrug and should be removed from the manuscript.

      (2) Barriers to using the NTR/Mtz system in non-aquatic model organisms

      In response to my suggestion that the title should include "zebrafish" to accurately convey the scope of the review to prospective readers, the authors stated that "there is no intrinsic barrier to adopting this technique more broadly in other systems," citing the example that "NTR was first developed in mice, but with a prodrug that proved difficult to use, and it was not widely pursued." These two statements are, however, contradictory: if the prodrug proved difficult to use, this constitutes precisely the kind of practical barrier the authors claim does not exist. The authors should clarify and reconcile this inconsistency, and provide a more thorough discussion of why the NTR/Mtz system has seen limited adoption in classical model organisms, such as mice and Drosophila.

    4. Reviewer #2 (Public review):

      Summary:

      Kim and Parsons reviewed the nitroreductase (NTR)/prodrug system: when engineered cells expressing the enzyme NTR are treated with prodrug (e.g. metronidazole), NTR converts the prodrug into cytotoxic compound which kill these cells. The review covers how the system has been developed, spatiotemporal control of targeted cell ablation, and its broad utility to study regenerative mechanisms, model human diseases, and screen chemicals to discover pro-regenerative and protective compounds. They further discussed the newer version of NTR, more potent prodrug, and experimental design, which not only expand the possible utility of the NTR/prodrug system, but allow the research community to develop a precise, reproducible and versatile platform.

      Strengths:

      The review summarized landmark work application of the NTR/prodrug system, and recent studies in model organisms, with focus on the model organism zebrafish. The review provides a good gateway to understanding the system and considering regenerative studies.

      Weaknesses:

      None.

      Comments on revisions:

      The authors have addressed the previous points, and the manuscript has been greatly improved.

    5. Reviewer #3 (Public review):

      Summary:

      This manuscript by Kim and Parsons presents an overview of the nitroreductase/metronidazole (NTR/MTZ) cell ablation system.

      Strengths:

      This manuscript nicely places the NTR/MTZ system in context of other cell ablation methods, with a discussion of their respective advantages and disadvantages. This review is particularly useful for highlighting the many ways the NTR/MTZ system has been applied to study regeneration of multiple cell types and to model different degenerative human diseases. The review concludes with a discussion on recent improvements made to the system and practical considerations and "best practices" for NTR-based experiments. This review could be a helpful resource, especially for researchers new to regeneration or cell ablation studies.

      Comments on revised version:

      I thank the reviewers for revising the manuscript to expand their discussion of using the prodrug/NTR system in other model organisms while also revising the abstract to make it clear this review will be zebrafish focused. With these revisions, this review provides an informative overview of how the prodrug/NTR system has not only been an important tool for regeneration studies and but also for elevating the zebrafish as a regeneration model. That said, including other model organisms could have been a nice addition to the last section on experimental considerations, especially in the context of discussing potential barriers to wider adoption of the NTR system. However, given that the vast majority of studies using the NTR system are in zebrafish, the current scope of this review is understandable.